How a 162-second retrain recovers SAM3's open-vocab refusal

Post 01 left SAM3 in a strange place. The watch-component IoU was up to 0.92— a clean win on the target task. But on SA-Co/Gold — the open-vocab promptable-segmentation benchmark SAM3 was originally evaluated on — the same checkpoint behaved like a completely different model. Two failure modes appear together: the model lost the ability to refuse (neg_correct on absent prompts dropped from 95.8 % to 3.2 %, cgF1 from 55.7 to 6.7), and its representations on broad open-vocab concepts drifted away from base — a more subtle erosion that costs a few IoU points on SA-Co positives and a couple of points on LVIS-val. The two failures share causes and respond to different fixes.

This post chases both fixes and finds a useful generalisation about when in the training pipeline a CF mitigation has to live to actually work. Three data-side recipes (replay, replay with negatives, post-hoc continuation training) recover partial slices and stop at sharply different ceilings. Each ceiling is geometrically explainable. The transferable lesson is in the replay-vs-recovery contrast: replay during the original fine-tune preserves general capability that post-hoc continuation training cannot recover — even when both regimes train on the same data. The post finishes with a one-shot retrain of a single 132 k-parameter MLP that takes 162 seconds on one GPU and recovers more refusal than every full-model retrain combined. Along the way: a methodological beat about what convergent mech-interp probes can and cannot tell you.

What actually broke — refusal collapse and OOD drift

SA-Co/Gold reports six numbers per checkpoint. Two of them move catastrophically under the watch FT; the others move a little. The catastrophic ones live on the negative side.

Table 1.1 — SA-Co/Gold, base SAM3 vs the frozen-TE watch fine-tune (7 subsets × 2 000 pairs, seed 0).
Metric	base SAM3	watch FT (frozen-TE)	Δ
neg_correct (refusal rate on absent concepts)	95.8 %	3.2 %	−92.6 pp
cgF1 (concept-grouped F1, the SA-Co headline)	55.7	6.7	−49.0
IL_MCC (instance-level matched-correlation coefficient)	0.828	0.171	−0.657
mean_mask_IoU (positive-and-negative averaged)	89.2	14.7	−74.5
pmF1 (positive-pair micro F1)	67.1	40.8	−26.3
pos_mean_IoU (mask quality on present concepts)	63.0	58.7	−4.3
LVIS mIoU (open-vocab positive segmentation, held-out)	56.4	53.9	−2.5

Reading the table top-down, two stories run side-by-side:

The catastrophic story is on the negative side. neg_correctcollapses from 95.8 % to 3.2 %. The model has essentially lost the “say no” decision. Because cgF1 and IL_MCC both penalise wrong masks on absent prompts — that’s what they were designed to measure on an open-vocab benchmark — both of those collapse with it. mean_mask_IoU averages across positives and negatives; the −74.5 drop is almost entirely false-positive masks dragging the average down on negative pairs.

The subtler story is on the positive side. The model has not forgotten how to segment an object when it’s asked to segment one. pos_mean_IoU on SA-Co/Gold positives drops only 4.3 points (63.0 → 58.7) and LVIS mIoU drops 2.5 points (56.4 → 53.9). pmF1 looks larger (−26.3) but that’s an artefact of how pmF1 aggregates across the full SA-Co/Gold sweep, which mixes positives and negatives: when the model fires constantly, the positive F1 is dragged down by precision on absent prompts. On strictly positive pairs the mask quality is near base. Catastrophic forgetting on SAM3-watch is a presence-detection problem, not a mask-quality problem.

We’ll come back to the small positive-side drops in §4 — they look minor here, but the three recipes we test in §2 produce sharply different outcomes on them, and the difference turns out to be load-bearing for the transferable lesson.

The visceral version — read the presence-token sigmoid directly

The rawest view of the failure is to read the gating scalar straight off the inference path. SAM3’s image processor multiplies a single scalar — presence_logit_dec.sigmoid() — into every per-box dot-product score before the confidence threshold (0.3) is applied. That scalar is the gate that says “is this NP grounded in this image at all?”. On 12 SA-Co/Gold negative prompts probed directly — harem pants, the Ford Maverick, Tibia, perfume bottle, and so on, on photographs that plainly don’t contain any of those things — base SAM3 produces a sigmoid value of ≤ 0.08 (typically ≤ 0.01) on all 12. The watch FT produces σ ≈ 1.000 on 12 of 12. The gate is stuck open.

FIG. 1.1presence_logit_dec.sigmoid() · 12 SA-Co/Gold negative prompts

The gate is stuck open

Per-prompt presence-sigmoid on 12 SA-Co/Gold negative prompts — concepts that plainly don’t appear in the corresponding images. Amber bars: the watch FT, saturated at σ ≈ 1.000 across all 12. Parchment bars: base SAM3, σ ≤ 0.08 on every prompt. The dotted line marks the 0.3 confidence threshold the processor uses; the FT’s gate sits well above it on every negative pair.

Two questions follow. Can we get the gate back without another full fine-tune? And does the mask-quality drift on positive prompts respond to the same interventions as the refusal collapse? The next section sets up three recipes designed to answer both.

Three recipes for repair

The classical catastrophic-forgetting mitigation is replay (also called rehearsal): keep showing examples from the old distribution while training on the new task, so the model can’t drift too far. The technique goes back at least to Robins (1995) and is the backbone of most modern continual-learning recipes (Lopez-Paz & Ranzato 2017; Chaudhry et al. 2019). We tested it in two flavours — positive-only replay and replay with an explicit negatives stream — plus the most-tried folk move when prevention isn’t an option: post-hoc continuation training (take the broken checkpoint and keep training on the mixed distribution). The literature is much quieter on whether post-hoc continuation actually works as a CF remedy; this post is partly an empirical answer.

All three start from the same base SAM3, target the same watch task, and use the same frozen text encoder (per Post 01’s finding that the text encoder is unnecessary for the watch task). The difference is what else they expose the model to during training.

Recipe family A — replay during fine-tuning

A two-stream training loop: every N-th watch batch is replaced by an LVIS batch. The watch task drives the fine-tune; the LVIS batches act as a regulariser against drift on broadly-LVIS-like inputs. We tested two sparsities:

lvis_replay: 1 LVIS batch per 10 watch batches (~10 % LVIS).
lvis_replay_n100: 1 LVIS batch per 100 watch batches (~1 % LVIS), as a sparsity ablation.

Both run for 3 epochs at the canonical watch-FT batch size (16 = 2 × RTX 4090). 30 k-image LVIS subset (deterministic, seed=0); each LVIS image is shown at most once across the full run. LVIS data is positive-only — every (image, prompt) pair has a real GT mask.

Recipe family B — replay with explicit negatives

One more stream: a third collator injects a (watch-patch, absent-watch-component) batch at the same N=10 cadence as the LVIS stream. Targets for this stream are empty mask sets — the loss penalises positive predictions on those pairs. The negatives corpus is 5 388 pairs constructed from quadrant-cropped watch patches: for each patch, the model is shown a watch component category (one of caseback, dial, hands, lumi, clasp, subdials, crown, print, indices, ..., 22 unique categories in all) that doesn’t actually appear in that patch. So the negative examples are not out-of-distribution distractors — they’re fine-grained false-positive cases right inside the watch task itself. Everything else matches lvis_replay: 3 epochs, batch size 16, frozen text encoder, 10 % LVIS replay. The only thing that differs is the negatives stream. This is lvis_replay_neg.

Recipe family C — post-hoc recovery from the broken FT

The mirror experiment: take the already-converged frozen-TE FT (watch IoU 0.922, neg_correct 0.032 — the broken state itself, our actual subject) and keep training on a mostly-LVIS data mix. Primary stream is LVIS-train; secondary stream is watch (the same 10 % cadence, but reversed in role: LVIS dominant, watch as replay). The optimiser and scheduler are reset to step 0 so the model trains as if from a fresh run, even though the weights are already converged.

The LR schedule needs some care for continuation runs. A vanilla warmup on already-converged weights produces a large early dip in every metric — the high-LR warmup ramp shakes the FT loose before the LVIS gradients have settled. After some calibration we settled on a custom “match-then-boost” schedule: linear warmup over the first 50 steps, plateau through step 400, then ramp LR 5× and sustain through step 2000. This avoids the early dip and pushes the LVIS gradients harder once the model is in a stable region. We save per-step checkpoints at {50, 100, 200, 400, 800, 1200, 1600, 2000} so we can see the trajectory. This is recovery_frozen_te_boost.

Note: the recovery recipe has no negatives stream. That’s deliberate — we want to test whether post-hoc continuation training on LVIS-only positive data can recover refusal at all. The contrast with lvis_replay_neg isolates two questions: does the model need negative examples to relearn refusal? and does it just need more training on broad-distribution positive data?

“On one axis: does the loss include a refusal signal? Plain LVIS replay has only positive labels; replay-with-negatives explicitly penalises false positives on negative prompts. On the other axis: is this signal present during the watch fine-tune, or applied afterwards? Replay variants regularise during training; recovery variants act post-hoc.”— The 2×2 we’re really testing

Headline evaluation — every recipe, every metric

Same eval suite as Post 01: SA-Co/Gold (7 subsets × 2 000 pairs, seed 0) for OOD; the canonical 30 k-image LVIS-val subset for held-out positives; the 180-task watch-component test split for in-domain.

Table 3.1 — All recipes, in-domain (watch) + two OOD benchmarks (SA-Co/Gold and LVIS-val). Bold = best per column among recipes (excluding base). Recipes are grouped: replay (during FT) above the rule, recovery (post-hoc) below.
Recipe	Watch IoU	LVIS mIoU	LVIS pmF1	SA-Co pmF1	SA-Co cgF1	SA-Co IL_MCC	SA-Co pos_mean	SA-Co neg_correct
base SAM3	0.227	56.4	48.14	67.08	55.72	0.828	62.95	95.8 %
frozen-TE FT (no replay)	0.922	53.9	31.09	40.77	6.73	0.171	58.72	3.2 %
`lvis_replay` (N=10, positive-only)	0.919	56.9	50.42	63.47	11.00	0.175	62.28	2.9 %
`lvis_replay_n100` (N=100, sparsity ablation)	0.916	54.9	46.80	57.40	7.93	0.139	60.96	2.9 %
`lvis_replay_neg` (adds negatives stream)	0.913	58.9	51.32	64.05	22.85	0.353	62.56	41.3 %

recovery@2000 (post-hoc, no negatives)	0.901	56.3	49.94	61.35	14.69	0.239	55.20	10.6 %

Reading the table:

The negatives stream is the only intervention that meaningfully recovers refusal. lvis_replay_neg jumps neg_correctfrom 2.9 % to 41.3 %. Plain LVIS replay at any density (N=10 or N=100) leaves neg_correct at ~3 %; post-hoc LVIS recovery only reaches 10.6 %. Adding more positive data doesn’t teach refusal at any density or timing. cgF1 tracks neg_correct closely: 6.7 (FT) → 11.0 (lvis_replay) → 14.7 (recovery) → 22.9 (lvis_replay_neg).
The positive-side metrics tell a different story. SA-Co pos_mean: base 63.0 → FT 58.7 → lvis_replay62.3 → recovery 55.2. The two recipes that train on LVIS positives don’t agree on positive-pair mask quality: replay preserves it (within 0.7 of base), recovery degrades it further (4.5 points below the already-FT-degraded starting point). Same pattern on LVIS mIoU: replay finishes at 56.9 (above base 56.4), recovery at 56.3 (worse than replay).
The two failures dissociate. lvis_replay_neg fixes both refusal and preserves OOD positive quality; lvis_replay preserves positive quality but doesn’t fix refusal; recovery half-fixes refusal but degrades positive quality. Whatever knob each metric responds to, the knobs aren’t the same.
Replay doesn’t trade off in-domain performance for OOD preservation. Watch IoU: FT 0.922, lvis_replay 0.919, lvis_replay_neg 0.913 — the replay variants land within ~0.003 of the no-replay FT, well inside run-to-run noise. Recovery, by contrast, drops watch IoU to 0.901 (~2 IoU points). So replay isn’t a compromise where we accept some in-domain cost to buy back OOD quality — it’s a Pareto move. We aren’t making the model worse at the watch task; we’re just refusing to let it forget how to do anything else. Post-hoc recovery is the only recipe that pays in-domain points for partial OOD recovery.
Cost framing. All three data-side recipes are full 3-epoch fine-tunes (replay variants 2 GPUs ~24 h, recovery 1 GPU ~6 h). The relevant cost comparison isn’t recipe-vs-recipe wall-clock though — it’s “what would I have to add to a future fine-tune to avoid this failure?”. The answer for lvis_replay_neg is ~10 % more training samples vs the naïve recipe (one negatives batch per ten watch batches). That’s a marginal cost, not a 2× cost — for any future task where CF on a broader benchmark is a concern, the answer is just “include a negatives stream”.

Two observations here are surprising and the rest of the post unpacks them with mechanistic data:

Replay during training preserves SA-Co pos_mean (62.3 vs base 63.0); post-hoc recovery on essentially the same data degrades it further (55.2). Why does timing matter so much? (§4 sets up the question; §6–§8 answer it mechanistically.)
None of the data-side recipes fully closes theneg_correct gap. §5 introduces a fundamentally different attack — a 162-second retrain of a single 132 k-param MLP — that does. §9 explains the methodological detour through standard mech-interp that almost convinced us the fix lived somewhere else entirely.

Replay preserves, recovery degrades — the timing matters

Compare the two LVIS-positive recipes side by side:

Table 4.1 — The same LVIS positive data, in two different timing regimes.
	`lvis_replay`	recovery@2000	base SAM3 (ref)
Where in training does LVIS data enter?	during the watch FT	after the watch FT	—
Total LVIS exposure	~3 000 batches	~1 800 batches	—
Watch IoU	0.919	0.901	0.227
LVIS mIoU	56.9	56.3	56.4
SA-Co pos_mean_IoU	62.3	55.2	63.0

Two recipes that nominally do the same thing — “train on watch + 10 % LVIS” — and a 7-point SA-Co positive-IoU gap between them. lvis_replay preserves base-level positive quality on SA-Co (62.3 vs base 63.0). Recovery degrades it further (55.2, well below the already-FT-degraded starting point of 58.7).

Watch IoU follows the same direction at smaller magnitude: replay 0.919, recovery 0.901, both starting from the same FT’s 0.922. Replay preserves the watch peak almost perfectly. Recovery erodes it ~2 IoU points.

And LVIS mIoU is the metric where recovery looks fine: 56.3 ≈ 56.9 ≈ 56.4 (base). The recovery training is on LVIS-train data, so by every standard generalisation argument, LVIS-val should be where it most clearly recovers. It does. The puzzle is that this specific metric is the only OOD positive-quality metric where post-hoc recovery succeeds.

“Post-hoc continuation training on data X recovers metrics that lie in (or near) X. Metrics that lie outside X can degrade further during recovery training, not less. Replay during the original fine-tune is qualitatively different: it prevents the broken trajectory from being entered in the first place. The difference is mechanistic, and §6-§8 unpack why.”— The transferable lesson

Before going deeper on what these recipes do internally, there’s a fourth recipe worth introducing — a different attack on the same problem. The three recipes above are all variants of “train the entire model with a different data mix”. They share the assumption that the fix has to live somewhere distributed across the network. The fourth recipe drops that assumption: identify the specific broken module and retrain only it. The mechanistic deep dive in §6-§8 will then explain all four recipes together.

A 132 k-parameter, 162-second fix

SAM3’s image processor implements presence-gating like this:

# sam3/model/sam3_image_processor.py
presence_score = outputs["presence_logit_dec"].sigmoid().unsqueeze(1)
out_probs      = (out_logits.sigmoid() * presence_score).squeeze(-1)
keep           = out_probs > self.confidence_threshold  # 0.3 in eval

A single scalar per image — presence_logit_dec — multiplies the per-box dot-product scores; the threshold is on the product. The scalar comes from a small MLP at the output of decoder[5]:

# sam3/model/decoder.py
intermediate_layer_presence_logits = self.presence_token_head(
    self.presence_token_out_norm(presence_out)
)

where presence_out is the q0 slot of decoder[5]’s output. The head is three linears (256→256→256→1) with aLayerNorm on the input and a learnable token (the q0 initial state). 132 609 parameters across 9 tensors. That’s the entire gating circuit between the decoder’s last layer and the score multiplication at inference time.

We saw in §1 what this gate looks like after the watch FT: 12 / 12 saturated at σ ≈ 1.000 on SA-Co negative prompts (base produces σ ≤ 0.08 on the same prompts). The gate has become a constant-near-+∞ function. If the fix is “teach the gate to not fire on absent concepts”, the most targeted intervention is to retrain just these 132 k parameters.

Retrain just the gate

The intervention writes itself: fork from lvis_replay (so positive mask quality on watch is preserved upstream), freeze everything except the 9 tensors of the presence head, train BCE on presence_logit_dec with a balanced positive / negative set. Positives: 2 500 (watch image, watch component label) pairs from the watch training set. Negatives: 2 500 (watch patch, absent-watch-component) pairs from the same 5 388-pair corpus the lvis_replay_neg stream used. Adam, lr=1e-4, 1 000 steps. Walltime 162 seconds on one RTX 4090.

The loss trajectory on a few selected steps:

Table 5.1 — `presence_only` training trajectory. pres_σ is the sigmoid of the presence logit on the prompt; the watch FT starts at saturated σ = 1.0 on both positive and negative pairs.
step	pos pres_σ	neg pres_σ	pos BCE	neg BCE
0	1.000	—	0.00	—
75	0.967	0.802	0.04	3.77
150	0.739	0.208	0.37	0.31
250	0.818	0.207	0.24	0.33
750	0.922	0.103	0.09	0.14
999	0.893	0.165	0.13	0.37

Negative pres_σ drops from the saturated 1.000 to ~0.17 within 200 steps; positive pres_σ stays at ~0.89 (the watch task wants it high). The gate learns to discriminate cleanly within 5 minutes of wall-clock.

What this buys at evaluation

Table 5.2 — `presence_only` endpoint on the full SA-Co/Gold + watch suite, with the three data-side recipes as comparison.
	Watch IoU	SA-Co cgF1	SA-Co IL_MCC	SA-Co mean_mask_IoU	SA-Co pos_mean	SA-Co neg_correct
base SAM3	0.227	55.7	0.828	89.2	63.0	95.8 %
`lvis_replay`	0.919	11.0	0.175	15.0	62.3	2.9 %
recovery@2000	0.901	14.7	0.239	20.2	55.2	10.6 %
`lvis_replay_neg` (full FT + negatives)	0.913	22.9	0.353	46.9	62.6	41.3 %
`presence_only` (162 s on top of lvis_replay)	0.879	40.4	0.650	82.8	66.1	87.85 %

neg_correct: 2.9 % → 87.85 %— recovers almost all the way to base’s 95.8 %. mean_mask_IoU 15.0 → 82.8 closes 92 % of the gap to base. cgF1 11.0 → 40.4. SA-Co pos_mean actually improves past base (63.0 → 66.1) because the watch-FT’s mask-quality gain on positive prompts is preserved (upstream is byte-identical to lvis_replay) and the now-correctly-calibrated gate stops stamping every prediction with σ = 1.

The watch task pays 4 IoU points (0.919 → 0.879). The presence head also gates watch predictions, and after retraining it outputs pos pres_σ ≈ 0.89 instead of the saturated 1.000 it inherited from lvis_replay. The masks themselves haven’t changed (the mask predictor is frozen); the gate just stops endorsing every prediction at σ = 1, so some borderline-positive masks fall below the 0.3 confidence threshold. This is a calibration trade-off, not a mask-quality loss; a class- weighted BCE (pos_weight > 1) would likely claw most of it back.

Why it generalises OOD — an architectural argument

The most surprising part of this result is what it was trained on vs what it was tested on:

Table 5.3 — Training vs evaluation distribution for `presence_only`.
	Training	Evaluation (SA-Co/Gold)
Image distribution	Watch close-ups + watch patches	Arbitrary photos (SA-1B, MetaCLIP, web)
Negative-NP source	22 watch-component categories absent from the specific patch	Open-domain human-written negative NPs
Positive-NP source	37 watch component categories	1 000+ open-vocab nouns (no overlap with watch parts)
Vocabulary overlap (train ∩ eval)	~ 0 (essentially empty intersection)
Scene overlap (train ∩ eval)	~ 0 (no test images are wristwatches)

Despite essentially zero vocabulary and scene overlap, the retrained presence head transfers near-perfectly. It doesn’t work because the model “knows what watch parts look like on SA-Co images” — most SA-Co images contain no watch at all. It works because the architecture positions the presence head as a domain-agnostic binary classifier of “this NP grounds in this image” vs “it doesn’t”. The gate reads only the q0 readout of decoder[5] — a per-(image, NP) scalar with no class-specific structure. Its role is binary grounding, not category recognition. The watch FT specifically miscalibrated it for the “absent” decision because the watch training data has no negative prompts. 162 s of in-domain BCE re-calibrated it back into its designed role.

“A 132 k-parameter MLP can be the single point of failure of an 840 M-parameter open-vocab model. Whenever the model’s downstream behaviour is gated by a small, named module, the cheapest intervention is to retrain that module directly — provided you’ve identified it. The next four sections are about why this works geometrically and why the data- side recipes can only get partway. They also unpack the methodological detour that finally pointed us at the presence head as the actual gate.”— The fourth recipe’s argument

Activation evidence — recipes produce structurally different activation states

A simple probe: for each checkpoint, push 97 prompts through 20 fixed watch images and capture three activation slots — fusion[4] (where Post 01 localised the watch task), decoder[5].q0 (the input to the presence-gating MLP), and decoder[5] mean (a deeper-readout aggregate). Per-prompt activations are then compared to base SAM3’s on the same prompts. The same 20 watch images are used as the visual substrate across all checkpoints, so the only thing varying is the prompt. The 97 prompts split into three groups:

37 watch components (the in-domain reference)
30 hand-curated LVIS-style category names (cat, dog, car, bicycle, table, pizza, ...)
30 SA-Co/Gold positive NPs (free-form open-vocab nouns from the actual benchmark)

Metric: per-concept relative drift ‖activation_FT − activation_base‖ / ‖activation_base‖ at three layers — fusion[4] (where the watch task crystallises, per Post 01), decoder[5].q0 (the input to the presence head), and decoder[5] mean (a deeper readout). The lower the rel_Δ, the closer the fine-tune’s activations sit to base’s on those prompts.

Table 5.1 — rel_Δ vs base on 97 OOD prompts across three activation layers. Lower = activations closer to base. Bold = lowest across recipes (excluding base).
Recipe	f[4] watch	f[4] LVIS	f[4] SA-Co	d[5].q0 watch	d[5].q0 LVIS	d[5].q0 SA-Co	d[5] mean watch	d[5] mean LVIS	d[5] mean SA-Co
`frozen_te`	0.813	0.817	0.799	0.713	0.733	0.730	0.677	0.720	0.716
`lvis_replay`	0.731	0.755	0.755	0.714	0.744	0.755	0.686	0.730	0.738
`lvis_replay_n100`	0.843	0.837	0.828	0.722	0.734	0.731	0.677	0.709	0.704
`lvis_replay_neg`	0.731	0.677	0.679	0.642	0.604	0.595	0.614	0.576	0.567
recovery @ step 50	0.808	0.816	0.799	0.708	0.736	0.736	0.671	0.720	0.715
recovery @ step 400	0.779	0.809	0.795	0.687	0.753	0.748	0.658	0.730	0.724
recovery @ step 800	0.783	0.804	0.787	0.644	0.680	0.682	0.601	0.663	0.662
recovery @ step 1200 (cgF1 peak)	0.774	0.798	0.783	0.689	0.729	0.723	0.640	0.692	0.686
recovery @ step 2000	0.859	0.903	0.856	0.651	0.655	0.633	0.612	0.626	0.601
`presence_only`	0.731	0.755	0.755	0.714	0.744	0.754	0.686	0.730	0.738

Three readings of the table.

1. At fusion[4], replay variants pull drift down, recovery pushes it up past where the FT had left it. On LVIS prompts: frozen-TE 0.817 → lvis_replay 0.755 (cleaner) → recovery@2000 0.903 (worse than the FT). On SA-Co/Gold prompts: 0.799 → 0.755 → 0.856. Same pattern on watch prompts: 0.813 → 0.731 → 0.859. Recovery training isn’t walking the model back toward base. It’s walking the model further from base in a different direction.

The full step trajectory makes the dynamics visible. At step 50 fusion[4] is essentially still at frozen-TE (0.816 LVIS vs frozen-TE 0.817 — the model has barely moved). Through steps 400, 800, 1200 the LVIS rel_Δ steadily decreases (0.809 → 0.804 → 0.798) — the model is being pulled toward LVIS-base-like geometry. Watch-IoU at this point is still 0.91-ish; SA-Co cgF1 peaks at 11.76 at step 1200. Then something flips: from step 1200 to step 2000, fusion[4] rel_Δ climbs from 0.798 to 0.903 — well past frozen-TE’s 0.817 — and SA-Co metrics start to oscillate downward. The model didn’t walk back to base; it walked past frozen-TE into a new LVIS-specialised state.

The non-monotone shape is itself the lesson. There’s a moment in recovery training when the model has the cleanest geometry vs base it’ll ever have on this run, and it sits at step 1200, roughly 60 % through training. Past that, the LVIS loss dominates and the model walks further from base, not closer. Without per-step evaluation we’d have stopped at 2000 and reported worse-than-frozen-TE numbers without realising the trajectory had an earlier peak.

rel_Δ at fusion[4] across recovery training steps, with replay variants as horizontal reference lines — Per-step `rel_Δ` of fusion[4] activations vs base SAM3, on 97 prompts × 20 watch images. Three recovery curves (LVIS / SA-Co / watch prompt groups) all share the same shape: gentle improvement through the cgF1 peak at step 1200, then a sharp climb past frozen-TE’s starting drift by step 2000. Dotted reference lines: `frozen_te` (start), `lvis_replay`, and `lvis_replay_neg` endpoints — the during-FT recipes sit at uniformly lower drift than any recovery point.

2. At decoder[5].q0 the recovery actually cleans up. Same row, deeper layer: frozen-TE 0.733 → recovery@2000 0.655 on LVIS prompts. So while fusion[4] is being LVIS-specialised further (drifting away from base), the decoder is being pulled back toward base. The two layers move in opposite directions during recovery training. The combined picture: a model that’s drifted further on its mid-stack representation but partially recovered its downstream readout.

3. lvis_replay_neg is the only recipe that cleans every layer × source. It cuts rel_Δ ~0.10-0.15 below frozen-TE on every cell in the table. The negatives stream is doing two things simultaneously: the upstream representations stay closer to base (consequence of per-concept differentiation pressure), and the downstream readout does too. presence_only (byte-identical to lvis_replay upstream by construction — only the 132 k presence head moved) appears in the table for control purposes.

Scatter: rel_Δ at decoder[5].q0 vs SA-Co neg_correct across all recipes; presence_only sits well above the data-side trend line — x-axis: `rel_Δ` of decoder[5].q0 activations (LVIS prompts) vs base — the input feeding the presence-head MLP. y-axis: SA-Co `neg_correct`. The data-side recipes (replay, replay_n100, recovery, replay_neg) trace a clean trend: the cleaner the gate’s input geometry, the more refusal capacity returns. `presence_only` breaks the trend entirely — its upstream is byte-identical to `lvis_replay` (same rel_Δ 0.744), but the gate has been retrained directly, so it recovers 88 % refusal on a still-drifted input. The gate-vs-input dichotomy the post turns on.

So at the activation level: replay during FT regularises the trajectory to stay close to base on OOD prompts; post-hoc recovery starts from a broken state and walks the model away from base on most layers (even though it’s training on LVIS data, which ought to pull toward base). The next two sections look at which weights moved and in what direction to explain the asymmetry.

Weight movement — same circuit, different magnitudes

The activation table tells us what each recipe does to the forward pass on OOD prompts. The next question: which parameters actually moved? For each recipe we compute the per-parameter Frobenius norm of the weight delta vs its starting checkpoint, normalised by the starting weight’s norm (the “rel_change” column in standard weight-diff reports). The replay/replay-neg recipes start from base SAM3; the recovery recipe starts from frozen-TE. Recovery’s rel_change values are therefore relative to a different baseline — useful for “what did the recovery training change?”, less directly comparable to the others.

Table 6.1 — Weighted mean rel_change per module group, each recipe vs its starting checkpoint. Bold = largest movement per group across the four full-fine-tune columns.
Module group	frozen_te − base	lvis_replay − base	lvis_replay_neg − base	presence_only − base	recovery − frozen_te
`dot_prod_scoring`	0.166	0.180	0.206	0.180	0.123
`decoder_ca_text`	0.153	0.165	0.175	0.165	0.114
`geometry_encoder`	0.146	0.152	0.168	0.152	0.116
`fusion_encoder`	0.129	0.136	0.142	0.136	0.100
`decoder_self_attn`	0.133	0.135	0.144	0.135	0.095
`transformer_decoder`	0.114	0.123	0.131	0.123	0.089
`vision_trunk`	0.015	0.016	0.016	0.016	0.010
`text_encoder` (frozen)	0.000	0.000	0.000	0.000	0.000

The structural reading is every recipe moves the same circuit, just by different amounts. Across the four full-fine-tune columns, the rank order of module groups is preserved: dot_prod_scoring moves most, then the decoder cross-text attention, the geometry encoder, the fusion encoder, the decoder self-attention, then the rest of the transformer decoder. The vision trunk barely moves (1.5–1.6 %) and the text encoder is frozen by recipe.

lvis_replay_neg pushes every cross-modal group further than lvis_replay does — dot_prod_scoring goes from 0.180 to 0.206, decoder cross-text from 0.165 to 0.175. The negatives stream adds extra gradient signal in the same parameter subspace; it’s not opening up new weights.

lvis_replay and presence_only are byte-identical here at the module-group level (both are lvis_replay’s weights upstream; presence_only only changes the 132 k presence head, which is a tiny slice of transformer_decoder — the group sum barely shifts).

Recovery numbers are ~30 % smaller across the board, consistent with measuring delta from frozen-TE (already fine-tuned) rather than base. The pattern is the same — same circuit, scaled down.

Direction matters — recovery moves the right parameter the wrong way

Pick the most-moved parameter and look at the direction each recipe pushed it. From Post 01 we know transformer.decoder.presence_token_out_norm.bias is the highest-ranked mover in every watch FT (rel_change 0.93 – 1.77×). It’s 256 dimensions, so it has a well-defined direction in vector space — we can compute cosines between recipes’ delta vectors.

Table 7.1 — Pairwise cosine of weight-delta directions on `presence_token_out_norm.bias`(256-d). “Norm rel base” = ‖Δ‖ / ‖base’s value‖. Same sign and magnitude direction means the recipes are moving the bias the same way.
Trajectory	‖Δ‖ rel base	cos to `frozen_te − base`	cos to `presence_only − base`
`frozen_te − base` (the FT’s break)	0.88	+1.00	−0.40
`lvis_replay − base`	0.93	+0.96	−0.43
`lvis_replay_neg − base` (partial fix)	1.00	+0.47	−0.32
`presence_only − base` (full fix)	1.77	−0.40	+1.00
`recovery@2000 − frozen_te`	0.50	+0.76	−0.67

The cosines tell the story numerically; the picture is easier to read geometrically. Project the 256-d bias deltas to a 2D plane defined by two reference directions: the “FT-break” axis (the unit vector along frozen-TE minus base) and the orthogonal component toward presence_only − base. Every recipe’s bias movement is a 2D arrow in this plane.

2D projection of presence-bias delta vectors: frozen_te / lvis_replay / recovery all along the FT-break axis; lvis_replay_neg at a partial angle; presence_only flipped to the opposite side. — Arrows from the origin (base SAM3) show each recipe’s movement of `transformer.decoder.presence_token_out_norm.bias` in the FT-break / orthogonal plane. The frozen-TE FT moves the bias along +x; `lvis_replay` lands almost on top of it. `lvis_replay_neg` is at about half the +x magnitude with a small orthogonal tilt — partial reversal. `presence_only` flips sign on +x entirely and adds a large orthogonal component, the only recipe whose direction is *opposite* to the break. Recovery@2000 (drawn from the frozen-TE tip) continues along +x — geometrically a continuation of the FT’s trajectory, not an undo of it.

Three readings of this table.

1. The FT’s break and lvis_replay’s movement are essentially the same direction (cos = +0.96). Replay reduces the geometric severity of the break very slightly (norm 0.88 → 0.93 — the bias moves a tiny bit further from base) but the direction is preserved. This is whylvis_replay can’t recover neg_correct: it’s still walking along the FT’s broken axis.

2. The recipes that fix refusal flip the direction. presence_only has cosine −0.40 with the FT’s break direction — it’s undoing the break and overshooting in the opposite direction (norm 1.77, more than 2× any other recipe’s movement on this bias). lvis_replay_neg is halfway there: cosine +0.47 with the FT’s break direction (still partly aligned), meaning the negatives stream pulled the bias partly back along the broken axis but the trajectory hasn’t flipped sign yet. This matches its partial recovery on neg_correct (41.3 % vs presence_only’s 88 %).

3. The recovery training is geometrically identical to a continuation of the FT’s break. cosine +0.76 with frozen_te − base. Even though recovery is training on LVIS positive data — the same data that lvis_replay uses to regularise the FT — it’s walking the bias further along the same broken axis. The norm 0.50 means the recovery moved the bias by half of frozen-TE’s deflection in the FT direction; the cumulative offset relative to base is therefore something like ~1.3× of base norm, well past where the FT left it.

“Recovery training without negatives isn’t failing to fix the gate because it’s missing the right parameter. It’s moving the right parameter in the wrong direction — it’s the FT’s trajectory, slower. Replay-during-FT runs along the same direction but at less magnitude; that’s why it preserves general capability without fixing refusal. Only recipes that explicitly include a refusal signal in the loss (lvis_replay_neg, presence_only) flip the bias direction. Geometry is destiny here.”— The post’s mech-interp punchline

One sanity check: the 1-scalar bias on the deeper presence head MLP — transformer.decoder.presence_token_head.layers.2.bias — gives a perfectly degenerate test, since it’s a single number. All five trajectory cosines collapse to ±1. We confirm the same pattern: frozen_te, lvis_replay, lvis_replay_neg, and recovery all push it in the same sign; presence_only is the only recipe that flips it. The bigger 256-d bias gives the continuous version; the scalar bias gives the discrete confirmation.

The mech-interp chase that went sideways

Before we landed on retraining the 132 k presence head, we spent considerable effort doing standard mech-interp on the lvis_replay ↔ lvis_replay_neg pair (the controlled comparison where everything but the negatives stream is held constant). The story below isn’t a list of mistakes — every probe is a reasonable thing to try, and three of them returned numbers that looked like clean causal evidence for a specific localised circuit. The interventions that followed all returned null. The methodological lesson is in why.

Four probes converged on `decoder[4].h0`

Layer-level cosine probe. For every layer in the fusion + decoder stack, we extracted the presence-token slot ( tgt[:, 0, :], “q0”) on a 42-negative SA-Co/Gold probe and measured cos((replay_neg.q0 − replay.q0), (base.q0 − replay.q0)) — “does the negatives-FT’s shift point back toward base?”. The signal sharpens monotonically through the decoder; peaks at L3-L5 (cos toward base = 0.61-0.65).
Per-head magnitude ranking. Patched nn.MultiheadAttention’s fused fast-path to capture per-head contributions; ranked the 24 cross-text-attention heads in decoder[3..5] by ‖replay_neg.h_contrib − replay.h_contrib‖₂ on negatives. decoder[4].h0 ranked 1st with ‖Δ‖ = 1.194 and cos-toward-base = +0.993. The replay_neg checkpoint silences this head’s magnitude on negatives from 1.32 to 0.15 — almost exactly to base’s 0.21. Looks like a named ringleader.
Causal patch test. Transplanted decoder[5]’s output from lvis_replay_neg into lvis_replay on the 34 negative pairs where the two checkpoints disagree. Patching decoder[5] alone silenced 82 % of false positives. Adding earlier decoder layers didn’t increase the silencing rate. This looks like rock-solid causal evidence that decoder[5]’s output is the locus.
Sparse autoencoder on decoder[5].q0. Trained a 256 → 1024 SAE (ReLU + L1, column-normalised decoder) on 418 SA-Co/Gold pairs. Ranked features by (mean fire on neg − pos) difference between lvis_replay_neg and lvis_replay. Identified three “base-recovery” features (949, 572, 403) that fire ~100 % on base negatives, drop to ~75 % in plain lvis_replay, recover to ~95 % in lvis_replay_neg. A clean feature-level basis for what was thought to be the suppression circuit.

All four probes pointed at the same place: a small set of cross-text-attention heads in decoder[4-5], with decoder[4].h0 as the leading mover, and the decision crystallising at decoder[5]’s output. That is exactly the situation in which a published mech-interp paper would write “we localised the recovery to the following named heads” and propose an intervention.

Four training-free interventions returned null

Table 8.1 — Training-free interventions on the ranked heads / activations. neg_correct measured on the full SA-Co/Gold sweep (7 subsets × 2 000 pairs).
Intervention	What it does	neg_correct
`lvis_replay` (starting point)	—	2.90 %
`surgery_h0` (α=1.0)	Replace `decoder[4].h0` in/out-proj rows with base values	2.90 %
`surgery_top4` (α=1.0)	Replace decoder[4].h0, h1, h2, decoder[5].h2 with base values	2.84 %
`surgery_h0_alpha05`	0.5·base + 0.5·lvis_replay on the same head slices	2.89 %
Activation steering at `d[5].q0`	Add α·(replay_neg.q0 − replay.q0) at decoder[5] output. α ∈ {0.5, 1.0, 1.5, 2.0, 3.0}.	0 – 2.4 %

Every intervention designed from the four-probe synthesis returned a null result. Activation steering at α > 0.5 actually decreased neg_correct further, even though by construction the steering direction was the exact activation-space direction the negatives-FT had travelled.

What the probes were measuring

The fix is downstream of every probe. The presence-token head (132 k params, three linears + a LayerNorm atdecoder[5]’s output) reads q0 and maps it to a single logit. The watch FT broke this MLP into a constant-near-+∞ function: it now maps any plausible q0 to σ ≈ 1. Stages 1, 2, and 4 of the probe measured the upstream correlate of the fix — the activation differences between lvis_replay and lvis_replay_neg live in q0’s neighbours, the cross-attention contributions that feed q0, and the SAE features that decompose q0. All of those are real activation-level differences. But the gate is the MLP after q0, and the MLP’s output is mostly invariant to its input when the FT has driven it into saturation. Changing the input doesn’t change the output.

Stage 3 (the causal patch) was the one ambiguous probe. Patching decoder[5]’s output worked — but it replaced both the cross-attention contribution that Stage 2 ranked AND the q0 slot the MLP reads, in lockstep. The naïve reading “cross-attention is the cause” was wrong; the correct reading “q0 is the gate input” was hidden by lockstep replacement.

Two transferable lessons

Lesson 1 — Replay during training is geometrically different from recovery training afterwards

On the same data, the two regimes produce different weight trajectories: replay walks along the FT’s break direction at reduced magnitude (preserving general capability without fixing the broken decision); continuation training walks further along the same broken direction (degrading general capability for the same parameters). The geometric statement is that the continuation-training regime sees gradients in the same basin as the original FT’s end-of-training gradients, so it walks the same path; the during-FT regime sees a mixed-distribution loss the whole time, and never enters the watch-only basin in the first place. If you can choose, regularise during training. Post-hoc continuation training cannot recover what replay would have prevented.

Lesson 2 — Activation localisation can be downstream-blind

Four reasonable mech-interp probes converged on the same activation-level circuit (decoder[4].h0 etc.) and four training-free interventions on that circuit returned null. The reason: the actual gating mechanism was a 132 k MLP downstream of every probe, and its output had saturated under the FT into a constant-+∞ function whose response is insensitive to input. Activation-level magnitude rankings find heads that move with a behavioural delta; they don’t necessarily find heads that cause it. The cheapest causal test in a model with a known gating circuit is to read its output directly at inference time — before the SAE, before the per-head ranking. A 132 k-param MLP can be the single point of failure of an 840 M-param open-vocab model.

“Open-vocab catastrophic forgetting on a single-task FT isn’t one failure mode; it’s the superposition of several. Mask quality on positives, OOD activation geometry, and refusal calibration all break along different axes and respond to different interventions. Reading them as a single “CF” number hides which axis each recipe fixes. The geometry tells you which recipes can compose with each other and which can’t.”— The methodological point

Notes & references

1. SA-Co/Gold numbers are on a 14 000-pair sweep (7 subsets × 2 000 pairs, seed = 0). LVIS-val numbers are on the canonical 30 k-image subset (seed = 0). Watch-IoU numbers are on the 180-task watch-component test split from Post 01.

2. presence_only is forked from lvis_replay’s checkpoint, not from the plain frozen-TE FT. The 162-s framing is relative to the lvis_replay base; an apples-to-apples comparison forking off frozen-TE directly hasn’t been run.

3. Activation-drift probe: 20 watch images × 97 prompts × 3 layers × 9 checkpoints. Source: scripts/sam3_ood_concept_drift.py + sam3_ood_concept_drift_recovery.py. Data dump at results/cross_domain/ood_concept_drift/probe_data.pt.

4. Weight-diff cosines: scripts/sam3_presence_bias_cosines.py. Data dump at results/cross_domain/presence_bias_cosines.json.

5. The failed probes and interventions in §9 are documented in the SAM3 frozen-TE research report (sections on the cross-attention / SAE probes, the weight-surgery experiments, and the activation-steering sweep). Source scripts: scripts/sam3_negative_probe.py, sam3_negative_head_per_head.py, sam3_negative_causal_patch.py, sam3_negative_sae.py, sam3_head_surgery.py, sam3_steering_vector.py.

How a watch fine-tune broke SAM3’s open-vocab refusal — and how we fixed it in 162 seconds of training

What actually broke — refusal collapse and OOD drift