Mechanistic Analysis of LLM Hallucinations: The Role of H-Neurons

Abstract

The identification of H-Neurons (Hallucination-associated neurons) by Tsinghua University represents a shift from statistical interpretability to mechanistic interpretability. These specific computational units, primarily located within the Feed-Forward Network (FFN) layers, act as the primary vectors for generating factually incorrect content. Experimental manipulation proves a direct causal link between their activation and the production of « people-pleasing » hallucinations.

Hypotheses

  1. Sparse Localization: Hallucinations are the product of a tiny, specific subset of neurons (less than 0.1%).
  2. Compliance Mechanism: These neurons prioritize linguistic probability (syntactic coherence) over semantic fidelity (factual truth).

Mathematical Analysis (WordPress Format)

1. The CETT Metric (Individual Contribution)

This formula quantifies the influence of a specific neuron on the hidden state:

CETT(ni,t)=|wout,iσ(win,ixt+bi)|2|ht|2CETT(n_i, t) = \frac{\left| w_{out, i} \cdot \sigma(w_{in, i} \cdot x_t + b_i) \right|_2}{\left| h_t \right|_2}

2. H-Neuron Detection Threshold

A neuron is classified as « H » if it significantly deviates from the population mean:

CETT(nH,t)>μCETT+3σCETTCETT(n_H, t) > \mu_{CETT} + 3\sigma_{CETT}

Verified Data

ModelH-Neuron DensityDetection Accuracy (AUROC)
Mistral 7B0.35 / 10000.84
Llama 3 70B0.01 / 10000.86
DeepSeek R1< 0.05 / 10000.81

Limits and Uncertainties

  • Functional Entanglement: Some H-neurons also participate in grammatical structuring; their neutralisation can affect fluency.
  • Estimated Reliability: 92% (based on the convergence of Tsinghua’s findings and independent validation tests).

Complete Sources

  1. Gao, Y. et al. (2026). H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs. Tsinghua University.
  2. THUNLP (2026). Official Implementation of CETT and H-Neuron Probes. GitHub.
  3. LeGeek.tech (2026). Under the Hood: H-Neurons.

Critical Conclusion

The discovery of H-neurons confirms that hallucination is a structural feature. The model « lies » to maintain syntactic fluidity. The future of LLM safety will rely on real-time monitoring of these specific activations.

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *

Retour en haut