Mechanistic interpretability

Mechanistic Interpretability

We study the inner workings of modern neural networks. Our goal is to understand what these models have actually learned, how that knowledge is organized inside them, and what their internal structure can tell us about learning and intelligence.

Logit Lens for Vision Transformer

We extend the logit lens to Vision Transformers, decoding intermediate representations directly into class embedding space. This reveals how predictions form layer-by-layer and turns ViT internals into something we can read.

Learn More

Contextual Inference in VLMs

Shown a single isolated object, can a vision–language model infer the scene around it? We probe the internal mechanisms behind this contextual leap and reveal the learned associations that link objects to their typical contexts.

Learn More

Object Localization in VLMs

Where does a vision–language model "know" an object is? Through causal probing and targeted ablations, we identify the components that carry spatial grounding and trace how localization information flows through the network.

Learn More

An Inner Interpretability Framework

Drawing on lessons from cognitive neuroscience, we propose a conceptual framework for building mechanistic explanations of AI systems — offering concrete strategies to move interpretability beyond ad-hoc analyses toward principled theory.

‌

No Code Website Builder