Data i godzina: wtorek, 28 stycznia 11:00 – 12:00
Prelegent: Maxime Méloux, MSc (Laboratoire d'Informatique de Grenoble)
Streszczenie: As AI systems are increasingly deployed in high-stakes real-world applications, ensuring their interpretability has become critical. Mechanistic Interpretability (MI) is a promising approach that aims to reverse-engineer neural networks to extract simple, human-understandable algorithms embedded in the neural structure that explain the model’s behavior. In our work, we investigate a fundamental concern with concrete formalizations of MI: do current criteria guarantee the non-existence of multiple, incompatible explanations? We identify two broad strategies to produce MI explanations, and systematically test the identifiability of both strategies using simple tasks and models that allow complete enumeration of candidate explanations. Our experiments reveal that current criteria suffer from identifiability issues at every stage, suggesting that those criteria are too permissive and need refinement to ensure identifiability. We discuss the generalization of our results to larger models and potential fixes based on stricter criteria.
Miejsce: B1-7/8 oraz online