Back to Blog

Opening the Blackbox of LLMs: My current research description

Published on October 3, 2025

When people talk about artificial intelligence, they often describe it as a black box. We type in a prompt, and the model produces something surprisingly fluent. But very few people, even researchers, can explain how it reached that answer.

That mystery is what drew me to looking into Mechanistic Interpretability, a field that tries to make sense of what’s going on inside these systems. Instead of treating models as magical oracles, we try to trace the thoughts of the language model. It is like peaking in their brain, probing in, and seeing how they react - like neuroscience of AI, maybe.

My research focuses on opening up that black box using a new approach: attribution graphs. Think of them as detailed maps of how information flows inside a language model when it responds to a prompt. Each graph shows not just the words we type in and the words the model produces, but the hidden chain of signals, decisions, and pathways that connect them.

By collecting thousands of these maps, I look for recurring motifs—small, repeatable structures in the graph that show up whenever the model performs a certain behavior. For example, if a model consistently recalls facts across multiple steps or refuses a harmful request, there may be a distinct subsystem that ‘lights up’ to make that happen. The next step is testing whether these motifs really matter. Using targeted interventions, we can shut off parts of the graph and see if the behavior disappears, or strengthen a pathway and watch the response change. If the motif is stable and causal, then we’ve learned something remarkable: a hidden building block of reasoning inside the AI.

Why does this matter outside the simulation? It stops the AI from behaving like a MechaHitler, as we saw a few months ago by Grok. Looking into the subsystems that cause bias, hallucinations, and harmful traits will help us make the AI models more aligned with the public's best interests. In a world where AI is increasingly shaping online conversations and moderating content, understanding these motifs and cataloging them helps us create tools for transparency and accountability.

Personally, this work is about more than the algorithms and graphs. It’s about trust. If the future depends on AI, then we deserve to know not just what these systems say, but why they say it. My research brings us a step closer to that transparency by turning the black box into something we can finally open, study, and, most importantly, hold accountable.

Have any thoughts or feedback for me? Feel free to reach out on LinkedIn and let us chat!.