We extend the logit lens to Vision Transformers, decoding intermediate representations directly into class embedding space. This reveals how predictions form layer-by-layer and turns ViT internals into something we can read.
Shown a single isolated object, can a vision–language model infer the scene around it? We probe the internal mechanisms behind this contextual leap and reveal the learned associations that link objects to their typical contexts.
Where does a vision–language model "know" an object is? Through causal probing and targeted ablations, we identify the components that carry spatial grounding and trace how localization information flows through the network.
Drawing on lessons from cognitive neuroscience, we propose a conceptual framework for building mechanistic explanations of AI systems — offering concrete strategies to move interpretability beyond ad-hoc analyses toward principled theory.
No Code Website Builder