Researchers led by Dr. Haggai Maron at the Technion, with collaborators from other universities and NVIDIA, have developed tools that inspect large language models for signs of hallucinations, memorized training data and other unreliable outputs. The approach shifts interpretability from fully explaining model behavior toward monitoring internal signals such as activations, attention maps and output probability distributions in real time.
One system, ACT-ViT, was presented at NeurIPS 2025 and analyzes activation patterns across all layers and tokens, treating them like a multidimensional grid processed by a Vision Transformer. It outperformed standard probing methods and showed strong performance when adapted to a previously unseen model while keeping the main system fixed.
A second method, CHARM, presented at ICLR 2026, represents attention patterns as graphs and uses a graph neural network to predict hallucinations at the token or response level. A third study, presented at AAAI 2026, introduced LOS-Net, which uses output probability distributions to detect hallucinations and data contamination in settings where internal model states are not available. Future work will explore combining activations, attention and output distributions into a broader monitoring system.
