M3I — Reasoning Telemetry

How M3I works, and where it couldn't reach

Architecture deep dive · 3 views

01

Where the failures live: a mid-network band

Cross-observer attribution identifies layers {6, 7, 8, 10, 13} as failure-carrying.

DeepSeek-V2-Lite layer targeting diagram showing attribution layers 6, 7, 8, 10, 13

DeepSeek-V2-Lite is 27 layers deep. Layer 0 is a dense MLP; layers 1–26 are MoE blocks. The attribution layers {6, 7, 8, 10, 13} form a non-contiguous cluster around the middle of the network — early-mid through mid-depth, with one gap at layer 9 and a longer gap before 13.

This is consistent with the literature view that mid-layers in transformer-based reasoners do the bulk of the cross-token information mixing where reasoning errors compound. Early layers handle tokenization and lexical features; late layers project back to vocabulary; the middle is where the actual reasoning gets composed and is where it breaks.

The crucial detail: Claude Sonnet 4.5 and Gemini 2.5 Pro independently flagged the same band. That two-observer agreement is what made these layers training targets, rather than any single auditor's opinion. A single LLM auditor could be biased; agreement between two independent ones is much harder to fake.

Inside each attribution layer Four MLA attention modules wrapped: q_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj. The MoE gate was a target too — but PEFT refused to wrap it. See section 02.

02

Why the MoE gate refused to be wrapped

The single most important architectural detail in this whole project.

Diagram of MoE gating mechanism: token enters gate, gate scores 64 experts, top 6 are activated, others stay idle

Mixture-of-Experts intuition. A token enters the MoE block. Instead of running through all available experts (which would be wasteful — DeepSeek-V2-Lite has 64 routed experts per layer), the gate scores all 64 and picks only the top 6 to actually compute on this particular token. The other 58 experts sit idle for this token. The 6 chosen experts' outputs get weighted-summed by their gate scores. Plus, two "shared" experts always run on every token, separately.

Sparse computation is the whole point. With 64 routed experts but only 6 active per token, each forward pass through this MoE block costs about (6 + 2) / 66 ≈ 12% of what it would cost if every expert had to compute. That's how DeepSeek-V2-Lite has 16B total parameters but only 2.4B active — most of the model is "asleep" for any given token. The gate is what enables this. It's also what specializes the experts: over training, different experts learn to handle different kinds of tokens (math syntax, code identifiers, common English, domain vocabulary) and the gate learns which experts to wake up for each input.

Why the gate isn't an nn.Linear The label "Linear → softmax → top-6" is a pipeline, not a layer. The actual MoEGate module wraps: a routing Linear (64 logits), softmax over those logits, top-k selection (sort + take top 6), a normalization step on the kept weights, and an auxiliary load-balancing loss term. PEFT 0.13+ checks the parent module's type, sees MoEGate rather than nn.Linear, and refuses to wrap it.

What we lost. Conceptually, the gate is where expert selection happens — the most natural place to teach the model "for crypto reasoning, prefer the experts that handle this domain." By being unable to wrap it, our LoRA can only modulate how tokens attend to each other, not which specialist experts they activate. The behavioral signature in the eval — "decisive but wrong" — is consistent with that: attention-only LoRA can change reasoning paths to be more committed without redirecting the underlying expert routing toward domain-appropriate specialists.

03

Two independent selectors at different stages

15.7B → 2.4B → 1.3M. The model is selecting parameters twice, for different reasons.

Nested boxes showing 15.7B total parameters, 2.4B active per token via MoE gating, 1.3M trainable via LoRA

The model has 15.7B parameters on disk. Per-token, MoE routing selects only ~2.4B (~15%) to actually compute on. Of that, PEFT exposes only 1.3M trainable parameters via LoRA — about 0.008% of the full model.

Two independent selectors operate at different times. The MoE gate decides which experts wake up at inference time, per token, and changes its decision every token. PEFT decides which projections accept gradient at training time, frozen for everything else, and never changes that decision once training starts. They don't know about each other.

The sample-efficiency story 1.3M trainable parameters · 116 preference pairs · 3 epochs. The rest of the network — including all 64 routed experts in every layer — remains bit-identical to base. If the methodology had worked, this would be one of the most parameter-efficient reasoning improvements ever published. It didn't, but the experimental design is the right shape.

This nested selection is what makes M3I's hypothesis testable in the first place: by intervening on a precisely defined slice of the network and leaving everything else frozen, we can attribute any behavioral change cleanly to the LoRA on those specific layers. With full fine-tuning, the same eval result would tell us almost nothing about where in the network the change happened.

Per-problem traces

12 held-out · Click row to expand

Problem

Base verdict

LoRA verdict

Pairwise

Length

Honest caveats

Why this is null, not concluded

n=12 held-out is small. The 95% CI on the pairwise win-rate spans roughly "LoRA significantly worse" through "no effect at all." Statistical power is low.
Stripped adapter. ~20% of attention adapter capacity (kv_b_proj) was dropped during GGUF conversion to work around an MLA-incompatibility in convert_lora_to_gguf.py. The full PEFT adapter has more capacity than what was actually evaluated.
Modest training. 116 preference pairs over 3 epochs. Small by DPO standards.
Capability floor. DeepSeek-V2-Lite at 16B is below the threshold for these problems — both base and LoRA are 90%+ wrong. Treatment effects are measured against a noisy floor.
No random-layer control yet. Without comparing attribution-targeted LoRA against same-config LoRA on randomly chosen layers, we cannot distinguish "the methodology adds no value" from "any DPO at this scale would have failed."

Two experiments stand between this and a publishable null result

Random-layer control Same hyperparams + 116 pairs, layers sampled from {0..26} ∖ {6,7,8,10,13}. Re-run Phase 5.
Generation-seed sensitivity Repeat Phase 5 three times to establish error bars around the 3-1-8 headline.
Cost ~90 min MI300X compute · ~$2 API spend

Two LLMs audit a third one's reasoning, identify the layers carrying the failures, and train LoRA-DPO on only those layers.

How M3I works, and where it couldn't reach

Per-problem traces

Honest caveats

Two experiments stand between this and a publishable null result