aj research · journal
Issue 03 · Nov 2025
Mechanistic Interpretability · SAM3 Series · Post 01

How a vision transformer learns a new task — what we found inside SAM3

We fine-tuned Meta’s SAM3 — an 840M-parameter promptable segmentation model — for a 37-class watch-component task and asked how parameter-efficient we could get. Mechanistic analysis localised the task to five fusion layers; a Marchenko-Pastur rank readout from a partial-FT SVD prescribed a heterogeneous LoRA that beat the full FT (0.9232 vs 0.9215 watch IoU) at 1.5 % of the parameter budget. Three findings generalise to other LoRA work: α wants to be much higher than rsLoRA’s literature default; the MP-filtered SVD rank converges in ~25 % of one epoch (future tasks can skip the upstream full FT); AdaLoRA’s slow warm-up disappears with a one-line init fix. We also catalogue the catastrophic out-of-domain damage these fine-tunes inflict on the base model, and the recipes that partially restore it.

cover · low-rank ΔW · fusion[4]
ΔW · LOW-RANK SIGNATURE010014FUSION ENCODER · F[0…5]F0F1F2F3F4★ 97.7%F5
In-domain IoU
0.23 → 0.92
base SAM3 → frozen-TE fine-tune
SA-Co neg-correct
95.8 % → 2-3 %
open-vocab refusal collapse
Fusion-only budget
7.9 M
0.94 % of the model · 92 % of the gap
Full-model LoRA
12.5 M
1.5 % of the model · MP-rank + 4× α · 0.9232 IoU
Mechanistic Interpretability · SAM3 Series · Post 01

We took Meta’s SAM3 — an ~840-million-parameter promptable segmentation model — and fine-tuned it to segment 37 watch components (bezel, dial, crown, lugs, clasp, pusher, rehaut, and so on). Watch-component test IoU rose from 0.23 on base SAM3 to 0.92 after fine-tuning. By every conventional measure that’s a successful single-task fine-tune.

For an 840 M-parameter model, “did it work” is the easy question — you can answer it from a single number on a test set. The harder, more interesting question is how it worked. The same topline IoU can come from very different internal changes — some surgical and well-localised, some sprawling, some apparently healthy in-domain but destructive everywhere else. This post is about prising those apart on a single concrete fine-tune, and using the results as a lens on how transformers learn new tasks in general.

We ran three recipes. The first updated every parameter: vision encoder, text encoder, fusion stack, decoder — the all-trainable fine-tune. The second was identical except the text encoder’s 86 M parameters were held byte-identical to base — the frozen-TE fine-tune. The third, motivated by what we’ll find in §3 about where the task actually lands, trained only the first five fusion-encoder layers — 7.9 M parameters, 0.94 % of the model — with vision, text, the last fusion layer, and the entire decoder held byte-identical to base. We’ll call it the fusion-only fine-tune. All three pass topline IoU within five points of each other; on the inside they tell very different stories about which of those 840 M parameters were actually load-bearing for the watch task. And all three, at the end of this post, fail in roughly the same way out of domain — which is where the next post picks up.

Quick reference · skip if you already know these
IoU
Intersection over union of predicted and ground-truth mask. 0 = no overlap, 1 = perfect.
cgF1
SA-Co/Gold’s primary open-vocabulary metric: F1 grouped by concept across a 1.4 k-image benchmark. Penalises false-positive masks on prompts the image doesn’t contain.
CKA
Centered kernel alignment — 0–1 similarity between two activation spaces; invariant to rotation and scale, sensitive to representational geometry. Used for comparing checkpoints.
Activation patching
Replace one model’s mid-forward tensor with another model’s, then measure downstream behaviour. The standard mechanistic-interp causal probe.
Effective rank
For a weight-delta matrix, the number of singular directions needed to capture 90 % of its energy, normalised by the full dimension. Lower = the update lives in a smaller subspace.
Mask-lens probe
A tiny prompt-conditioned bilinear probe trained at each vision block to predict the answer mask. Probe IoU vs. block index tells you where the network first commits.
01

The weights barely moved — and they moved in a sliver of the available subspace

The first question on any mechanistic-interpretability run is deceptively simple: which weights changed, and by how much? SAM3’s parameters break into module groups — vision trunk, text encoder, fusion encoder, transformer decoder, attention sub-blocks, and the small heads that produce masks and scores. For each parameter tensor we have the base value and the fine-tuned value; the weight-delta is just their difference. Two summary statistics matter: the relative L2 change (how much did this tensor move?), and the effective rank of the delta (in how many directions did it move?). A 256×256 delta of rank 25 isn’t really a 65 k-knob update; it’s a 25-direction update wearing a high-dimensional coat.

For the frozen-TE fine-tune, the picture is striking: every parameter group that moved at all moved within 10–40 % of its available rank. The single smallest-rank update is dot_prod_scoring — the head that turns prompt×image into match logits — at 9.6 % effective rank. The largest is the transformer decoder at 40 %. Nothing moves in a full-rank way.

FIG. 1.1frozen-TE recipe

Effective rank of the weight delta, by parameter group

0%10%20%30%40%transformer_decoder39.9 %decoder_cross_attn_image31.3 %fusion_encoder31.0 %geometry_encoder27.1 %decoder_ca_text18.5 %decoder_self_attn17.8 %mask_predictor17.0 %cross_attend_prompt13.6 %vision_trunk12.4 %dot_prod_scoring9.6 %MEAN EFFECTIVE RANK (90 % ENERGY) · % OF FULL RANK
Effective rank of the weight-delta per parameter group (frozen-TE fine-tune), as a percentage of full rank. Modules with no entry were either frozen (text encoder, text resizer) or non-matrix parameters where rank is not defined. Read it as: in this module, the update lived in a subspace this fraction the size of what it could have used.

The vision trunk — 447 M parameters, the bulk of the model — moved with a weighted relative change of 1.5 % and an effective rank around 12 %. That’s a sub-1 % update along a thin subspace across hundreds of millions of weights. The decoder and fusion-stack deltas are larger in magnitude but still rank-constrained. Combined with the cosine-similarity-to-base column from the same artifact (every group above 0.99), this is the weight-space signature of a well-behaved fine-tune: small, structured, and far from the kind of wholesale rewrite a topline IoU jump of 0.7 points might suggest.

Low-rank weights tell us the update is structured. It does not tell us where in the model the change matters. For that, we switch from looking at parameters to looking at activations.

02

Localising the task — the modular swap matrix

SAM3’s parameters group cleanly into four functional modules. Call them T (the text encoder, ~86 M parameters), V (the vision encoder, ~639 M), F (the fusion encoder, ~30 M, where text and vision tokens interact), and D (the decoder plus segmentation and scoring heads, ~85 M). For any subset of these modules, we can build a hybrid checkpoint by copying that subset’s weights from the fine-tune and the remainder from base — a Frankenstein state-dict, one half each.

Eight such hybrids form a swap matrix. Each cell tells us how much of the IoU gap that module’s parameters carry on their own, with every other parameter held at base. If one module accounted for the entire fine-tune, that module’s row would hit 100 %; if no module did, every row would stay near 0.

Table 2.1 — Swap matrix on the frozen-TE recipe
CellModules taken from FTIoU% of gap
A — baselinenothing (all base)0.23
0%
B — toplineeverything (all FT)0.92
100%
C — text onlyT0.23
0%
D — vision onlyV0.52
43%
E — fusion onlyF0.60
54%
F — decoder onlyD0.47
34%
G — encodersV + F0.52
43%
H — downstreamF + D0.70
68%

Three things to read off. Cell C: the text encoder’s parameters carry 0 % of the gap on their own. Whatever the text encoder learned during the all-trainable fine-tune, those rotations alone don’t move the needle on watch-component IoU — foreshadowing the next section. Cells D, E, F: no single module reaches the topline; each carries 34–54 %. The fine-tune is distributed across vision, fusion, and decoder. Cell H: handing the fine-tune’s fusion and decoder weights to the base vision encoder recovers 68 % of the gap. That is the cleanest single number for “what the fusion plus decoder weight changes are worth on their own” — and the number we’ll come back to in §3 when we compare it against the activation-level probe.

This is useful but coarse. Module groups have many layers each. The next probe sharpens both the layer resolution and the question of what, exactly, fits through which point in the network.

03

The 256-dim residual at fusion[4] is the compatibility interface

To say anything sharper than “the fusion+decoder weights are worth 68 %” we need to know what is actually being passed between modules. Before the probe, the architecture.

FIG. 3.1text · decoder · vision

SAM3 architecture — three residual / memory paths

text residualdecoder queryvision memory bustext memory bus★ PATCH POINT · 256-D RESIDUALFUSIONDECODERVISIONTF[0]F[1]F[2]F[3]F[4]F[5]text memory busQ₀D[0]D[1]D[2]D[3]D[4]D[5]masksV (vision)consulted at every layer
SAM3 has three residual / memory paths. The text residual flows left-to-right through F[0..5] — this is what cross-attends to image features at every fusion layer. The decoder is a separate stack of six layers D[0..5] whose initial input Q₀ is a learnable embedding (not a continuation of the fusion text residual). Each D[i] cross-attends back to two memories: the fusion text memory bus (F[5]’s final output) and the vision memory bus (the vision_neck output). The vision memory bus is also consulted by every fusion layer. The activation-patching experiment overwrites a single 256-dim tensor mid-forward — at fusion[4] it swaps the text residual at that depth; at decoder[i] it swaps that decoder layer’s output. The two memory buses are never themselves patched, which is why each subsequent layer still cross-attends back to base-shaped memory.

Two consequences of this layout matter for what comes next. First, the fusion encoder operates in a 256-dim bottleneck — text tokens enter natively at 1024 dimensions and are projected down by text_resizer, while the vision encoder’s output is projected from its native width down to 256 by vision_neck. Fusion is the narrow channel where text and vision are first forced to share a representation; that bottleneck width is what makes the “single tensor” framing later in this section quantitatively meaningful. Second, only the text path is a residual stream — the vision features sit beside it as a constant memory consulted at every layer. We can patch the text-side residual at any depth; the vision memory bus keeps doing what it would have done.

Now the probe. Activation patching swaps a single tensor mid-forward, not weights. Run base and fine-tune on the same image; at some named point on the text residual stream, replace base’s activation with the fine-tune’s; let base’s downstream weights continue. Measure the resulting IoU. This tests whether the information needed to do the task is encodable at that layer’s 256-dim residual in a format base’s downstream can read.

FIG. 3.2base + FT-activation, frozen-TE recipe

Per-layer IoU recovery from activation patching

0.000.250.500.751.0060%F[0]90%F[1]97%F[2]F[3]F[4]F[5]PLATEAU 97.5–97.8 %D[0]D[1]D[2]D[3]D[4]D[5]0 % every layer — decoder co-adaptedFUSION ENCODERTRANSFORMER DECODERPATCHED IOU ÷ FT IOU
Each green bar is the watch-test IoU after replacing base’s text-residual activation at that layer with the fine-tune’s, measured on a stratified probe set of 40 tasks where base scores exactly 0.0 (so the patched IoU is what the layer’s activation can recover on its own). FT IoU on the same subset is 0.98. Recovery rises through the fusion stack and saturates by fusion[3]. The decoder bars are at exactly 0.0 — patching FT’s decoder activation into base breaks the prediction entirely (the base mask predictor cannot read an FT-shaped decoder activation), which is the activation-level signature of the decoder co-adaptation we’ll return to in §5.

Recovery rises steeply through early fusion and saturates by fusion[3]: 60 % of FT IoU at F[0], 90 % at F[1], 97 % at F[2], 97.5 % at F[3], 97.7 % at F[4], 97.8 % at F[5]. The plateau says F[3], F[4], and F[5] are all roughly equivalent vantage points — fusion[4] isn’t a uniquely special layer, it’s simply the depth at which the cumulative summary has fully formed. We use F[4] as the reference because by then we are comfortably on the plateau.

Now the careful reading — and this is where it’s easy to overclaim. The 97.7 % figure is the recovery when base’s text residual is overwritten with the fine-tune’s. But the tensor we’re injecting was computed by the fine-tune’s vision encoder, text encoder, and four fusion layers running in a parallel forward pass. So part of that 97.7 % is the fusion+decoder weight changes from §2 — and part of it is V_FT’s contribution riding in on the patched activation. Quantitatively:

Table 3.3 — Two probes at the same point in the network
ProbeWhat it measures% of gap
Cell H, parameter swapbase V + base T + FT fusion + FT decoder weights68 %
Activation patch at fusion[4]all-base weights, but text residual at F[4] overwritten with FT’s — which encodes V_FT and F_FT[0..3]97.7 %
Section 3 takeaway
The 256-dim residual at fusion[4] is base-decoder compatible: it carries the fine-tune’s full upstream re-representation, and base’s frozen downstream still decodes it correctly.

F+D’s own weight changes on top of base vision recover 68 % of the gap. Activation patching at fusion[4] recovers 97.7 % — the extra ~30 points is V_FT’s contribution arriving via the patched activation. The meaningful claim isn’t that fusion[4]“crystallises” the task in its parameters; the meaningful claim is that the 256-dim residual at that depth is wide enough to fit the FT’s entire upstream re-representation, and that base’s untouched F[5] + decoder + heads can still read it. That holds across both recipes.

The reverse direction is sharply asymmetric. Patch base’s text-residual into the fine-tune’s pipeline at any decoder layer and IoU drops to zero. The decoder has co-adapted to whatever the fine-tune’s encoder produces; replace the encoder’s output with base’s and it can’t decode. We return to that asymmetry in §5.

04

The text-encoder rotation was mostly unnecessary

The all-trainable fine-tune updated the text encoder along with everything else. Its 86 M parameters moved, and the embeddings of the 37 watch concepts rotated — some a lot, some barely.

FIG. 4.1cosine to base × shift magnitude

Per-concept text-embedding shift under the all-trainable fine-tune

0.20.40.60.81.00.000.370.751.121.50rotated most ↘↖ barely movedrehautlunettesealingpusherlugsclockworkmetal strapmetal dialdialbezelstrapleatherpearlCOSINE SIMILARITY (BASE ↔ FINE-TUNED EMBEDDING)RELATIVE SHIFT MAGNITUDE
X-axis: cosine similarity between the base and fine-tune embeddings for the same concept (rightward = less rotated). Y-axis: relative shift magnitude. Concepts top-left (rehaut, lunette, sealing) rotated most; the photographic terms bottom-right barely moved. Hover any dot for cosine, shift magnitude, and orthogonality.

The interesting question is whether any of this rotation was load-bearing. So we re-ran the recipe with one change: the text encoder’s gradient flow disabled, its parameters held byte-identical to base. Same vision, same fusion, same decoder, same hyperparameters — only the text path frozen.

Table 4.2 — Freezing the text encoder is strictly better
RecipeWatch IoUWatch DiceSA-Co/Gold cgF1
all-trainable fine-tune0.91694.44.88
frozen-TE fine-tune0.92294.86.73

Freezing the text encoder is strictly better on every metric we measured. In-domain IoU is +0.006, Dice is +0.4, and the out-of-domain open-vocabulary cgF1 is +1.86. Whatever the all-trainable text-encoder rotation was contributing in-domain, the rest of the model reconstructs the equivalent behaviour from frozen text vectors.

What was the rotation doing? The one mechanism it produced that downstream layers couldn’t reproduce is compound-noun separation: in base SAM3, strap and metal strap have text-embedding cosine 0.71; in the all-trainable fine-tune that cosine drops to 0.30. But the frozen-TE fine-tune matches its watch IoU without ever doing this separation — the downstream layers achieve the same disambiguation a different way. The TE rotation was a redundant mechanism for an in-domain problem.

The stronger evidence that the rest of the model picks up the slack comes from a small probe at each vision-encoder block: a prompt-conditioned bilinear head trained to predict the answer mask from that block’s activations. The probe’s IoU as a function of block index tells us where the network first commits to its answer.

FIG. 4.3where the network first commits

Mask-lens probe IoU vs. vision-encoder block index

0.550.600.650.700.750.800.850510152025300.836 @ block 230.809 @ block 29frozen-TEall-trainablebaseVISION-ENCODER BLOCK INDEXMASK-LENS PROBE IOU
Base (dashed) is essentially flat — the vision encoder isn’t doing watch-component segmentation on its own. The all-trainable fine-tune peaks late, at block 29. The frozen-TE fine-tune peaks earlier, at block 23, and at a higher value (0.836). Freezing the text path frees the vision path to commit sooner.

Freezing the text encoder lets the vision encoder commit to its answer two blocks earlier (block 23 vs block 29) and at a higher peak probe IoU (0.836 vs 0.809). Two blocks is small but consistent with the rest of the picture: the all-trainable recipe spends some of its parameter budget on text-side rotation; the frozen-TE recipe spends the same gradient signal on vision-side commitment.

A sanity check for the vision encoder: we also train sparse autoencoders on the vision residual stream at blocks 23 and 31 in both base and fine-tune, and ask whether the fine-tune invents new features or just reweights existing ones. The answer is decisive — all 4096 patch features in the fine-tune match a base feature at cosine ≥ 0.94 at both layers. The fine-tune reweights the vision dictionary; it never expands it.

05

The decoder is a co-adapted read-out

Back to the activation patching from §3, in the opposite direction. Patching the fine-tune’s fusion[4] output into base gave base 97.7 % of the FT’s IoU — the information was in the activation, and base’s decoder could read it. What happens if we patch base’s activation into the fine-tune’s pipeline?

Table 5.1 — Reverse patching: base activation into FT pipeline
Patched layerFT receiving base activation, IoU
fusion[0]–[5]0.35–0.92 (degrades smoothly)
decoder[0]–[5]0.00 at every layer

At any decoder layer, replacing the fine-tune’s activation with base’s produces zero IoU. The decoder is not interchangeable with itself across checkpoints — its activations only make sense in the company of the encoder it was trained next to. Replace them with base’s and the read-out collapses. This holds even though the per-layer residual write magnitudes look similar between base and fine-tune; what differs is the geometry. Base ↔ FT CKA in the decoder hovers around 0.2, an order of magnitude lower than in the fusion stack.

Mechanistically the decoder isn’t storing separate, portable knowledge of watch components. It’s a learned mapping from “the specific representation style this encoder produces” to a mask. Cheap to retrain when the encoder changes; not portable across encoders — at least, that is the picture when the decoder is one of the things we’re training. §6 sharpens that qualifier.

06

What if we only train the fusion stack?

Everything so far points at the fusion encoder as the place where the watch task lands — activation patching at fusion[4] already recovered 97.7 % of the gap; cell H of the swap matrix put fusion+decoder weights at 68 % even with base vision; the low-rank weight signature concentrated there. That suggests a clean experimental question: what if we hold everything else — vision encoder, text encoder, decoder, every head — byte-identical to base, and only let the fusion stack move?

We trained one more recipe: only the first five fusion layers trainable (F[0..4]), the sixth fusion layer and the entire decoder frozen along with V and T. Trainable budget: 7.9 M parameters, 0.94 % of the model. Same watch data as before, no negatives, no LVIS replay, 3 epochs, 5.2 hours on two RTX 4090s. Call it the fusion-only fine-tune.

Watch-test IoU lands at 0.870— 92.5 % of the 0.230 → 0.922 gap, from under one percent of the parameters. That alone is striking: cell E of the swap matrix (when we took the all-trainable / frozen-TE checkpoints’ fusion weights and ran them with base V and base D) only recovered 53.7 % of the gap. Training fusion fresh against frozen base vision and base decoder finds a roughly 40-percentage-point better optimum than the fusion weights from the standard recipes — presumably because the gradient now has to produce a residual that base’s downstream can decode, rather than co-adapting with a moving decoder. With everything else pinned, fusion fits the task by itself, and it fits it cleanly.

Most of the in-domain capacity of a vision-language fine-tune lives in the cross-attention fusion stack, not in the encoders or the decoder.Section 6 takeaway

Re-running the §3 activation-patching probe on the fusion-only checkpoint sharpens the picture of where the task is being built. The recovery curve is the same shape as before — steep climb through fusion, decoder bars at zero — but the early-fusion contribution is much smaller:

FIG. 6.1two recipes overlaid

Activation-patching recovery — frozen-TE vs. fusion-only

0 %25 %50 %75 %100 %FUSIONDECODERF[0]F[1]F[2]F[3]F[4]F[5]D[0]D[1]D[2]D[3]D[4]D[5]60% — V_FT arrives via cross-attn2% — unchanged image featuresfrozen-TEfusion-onlyPATCH POINTPATCHED IOU AS % OF FT
Each curve is the patched IoU as a percentage of that recipe’s FT baseline on the activation-patching probe set (where base = 0). In the frozen-TE recipe (amber), patching just F[0] already recovers 60 % of FT IoU because V_FT’s task-shaped image features arrive into F[0]’s cross-attention. In the fusion-only recipe (green), V is byte-identical to base, so F[0]’s cross-attention reads unchanged image features — recovery drops to 2 %. The task has to be built iteratively by F[1..3] attending to the same image memory from progressively more task-shaped queries. Both curves saturate by F[3].

The contrast cleans up a misleading reading of the original crystallisation story. In frozen-TE, F[0]’s 60 % recovery looked like the first fusion layer doing most of the work; in fact a big chunk of that 60 % is V_FT’s contribution riding into F[0] through cross-attention. Freezing V removes the smuggled-in signal and exposes the actual per-layer contribution: fusion needs three rounds of cross-attention to a fixed image memory to build up the task representation.

The most-moved fusion layer contributed the least

Tracking which fusion layer’s weights moved farthest produces the most counter-intuitive number of the whole run. For each trainable fusion layer in the fusion-only fine-tune: weight rel-change vs. activation-patching contribution moves in opposite directions across the stack. F[0]’s weights moved farthest of any fusion layer (18.9 % of base-norm magnitude) but contributed essentially nothing functionally. F[4]’s weights moved least but contributed all of the gap.

FIG. 6.2fusion-only recipe

Weight motion vs. functional contribution, per fusion layer

02040608010018.92F[0]17.059F[1]14.892F[2]13.6100F[3]12.6100F[4]weight motion (rel-change %)functional contribution (% gap recovered)TRAINABLE FUSION LAYER (FUSION-ONLY RECIPE)PERCENT← most movedall the work →
For each trainable fusion layer in the fusion-only fine-tune: weight rel-change (amber, %) and activation-patching contribution (green, % of IoU gap recovered). The two measures move in opposite directions across the stack. F[0]’s weights moved farthest of any fusion layer (18.9 % of base-norm magnitude) but contributed essentially nothing functionally. F[4]’s weights moved least but contributed all of the gap.

How can the layer that moved the most contribute the least? Because the parameter motion at F[0] is largely wandering along a loss-invariant direction that the trainable downstream layers absorb. Pre-LayerNorm residual blocks have a well-known symmetry: rotate one block’s output by R, let a later trainable block silently rotate by R⁻¹, and the loss is unchanged. Inside the trainable fusion stack each early layer has multiple later trainable layers willing to compensate for its drift — the optimiser pays no penalty for wandering, so it does. At F[4], the next layer (F[5]) is frozen; the next layers after it (decoder, heads) are all frozen too. There is no downstream slack to absorb gauge drift; F[4]’s every motion has to be task-relevant or it gets pulled back by the loss.

Three independent measurements support the gauge-drift reading. First, F[0]’s output has CKA 0.93 with base — the function it implements has barely changed, even though the parameters have. Second, F[0]’s residual norm actually shrinks slightly relative to base (6.75 vs 7.27); it is not pushing the residual outward toward a task-shaped geometry. Third, the effective rank of the whole fusion-encoder update drops from 0.31 in frozen-TE to 0.24 here — the smaller-rank number is computed on the change of input-to-output function, not on raw parameter delta, so it excludes null-direction drift. Roughly two-thirds of F[0]’s 18.9 % weight motion lives in functional null-space.

The decoder co-adaptation asymmetry from §5 vanishes

§5 reported that patching base’s decoder activations into the fine-tune at any decoder layer produced IoU = 0 everywhere. The fusion-only recipe lets us check whether that was a property of the decoder’s architecture or of the fact that the decoder had been trained. The decoder is byte-identical to base in this recipe, so the same patching test reads:

Table 6.2 — Reverse patch into decoder, two recipes
Patched layerfrozen-TE FTfusion-only FT
decoder[0]0.000.92
decoder[1]0.000.94
decoder[2]0.000.91
decoder[3]0.000.84
decoder[4]0.000.76
decoder[5]0.000.68

The decoder still degrades when fed base’s activations (0.92 → 0.68 across the six decoder layers), but the catastrophic asymmetry disappears entirely — no layer collapses to zero. Which means §5’s “decoder co-adapted to a specific encoder” finding is more precisely a finding about training the decoder, not about the decoder. A base decoder accepts both base activations and fusion-only activations because it hasn’t been pulled toward either; the moment we let it train alongside the encoder, it co-adapts and stops accepting anything other than its own encoder’s output.

We’ll save the latent-space picture across all three recipes for §7 below — one short summary first: the watch-task concept merge we’ll quantify there is essentially unchanged in fusion-only. Even with V, T, and the entire decoder held byte-identical to base, the peripheral-metal cluster collapses at fusion[4] by the same +0.29–+0.43 cosine shift as the standard recipes. The merge is a property of the watch-task gradient acting on fusion, not of anything happening in the modules that fusion-only froze.

07

What does the fine-tune actually change about how concepts relate?

The natural next question is whether the fine-tune changed how the model thinks of its 37 concepts in relation to each other. That sounds like one question; it’s actually three. The model carries a representation of each concept at the text encoder (the CLIP-style 1024-dim embedding for a prompt), at fusion[4] (the mid-stack 256-dim residual after the text token has cross-attended four times to the image), and at the predicted-mask output (which pixels does the model segment when you give it this prompt). The three views can give very different answers, and getting that straight matters more than any specific finding the section reports.

We’ll lead with predicted-mask IoU between concept pairs — for every pair of prompts (A, B), the IoU of the predicted mask for A vs. the predicted mask for B on watch test images. This is the most behaviour-anchored probe of the three: if the model’s masks for “lugs” and “metal” substantially overlap, the model is treating those concepts as the same thing in the only place the user can observe. Activation-space probes don’t guarantee that; the mask does.

A methodological note worth surfacing up front: when we originally probed fusion[4] cosines we expected them to be a good proxy for “concept identity”. They aren’t. Across the 666 concept pairs in our 37-concept set, even base SAM3 on real watch images has a mean off-diagonal cosine of 0.90 at fusion[4], with 65 % of pairs already above 0.9. Replacing the watch images with a pure black image pushes that mean to 0.93. In the fine-tunes the manifold is even tighter (frozen-TE on watch averages 0.93, on black 0.97; fusion-only on black averages 0.97). So fusion[4] is a saturated cluster: most concept pairs sit inside a narrow band of cosines regardless of whether the prompts are synonyms or opposites. The Pearson correlation between per-pair fusion[4] cosine and per-pair predicted-mask IoU is only 0.22–0.52 across recipes — these two measures are not measuring the same thing. We use fusion[4] as a complementary probe of relative residual-stream geometry; we don’t treat its absolute cosines as concept-identity claims.

The predicted-mask confusability matrix

For the 37-concept set, cell (A, B) of the matrix below is the IoU between the predicted masks for prompts A and B on watch images that contain both concepts. Most cells are dark — the model produces distinct masks for most prompt pairs. The warm cells are the small subset of pairs where masks actually overlap.

FIG. 7.1frozen-TE on watch test set

Predicted-mask confusability — 12 of 37 concepts shown

bezeldialcrownlugsstrapclasppusherrehautlunettehandsh-markercasebezeldialcrownlugsstrapclasppusherrehautlunettehandsh-markercase0.850.220.190.320.240.890.200.190.930.210.240.230.970.190.190.880.210.210.920.250.240.960.190.220.210.870.240.320.910.220.210.200.950.860.200.190.901.00CONFUSION RATE
Predicted-mask IoU between concept-pair prompts for the frozen-TE fine-tune. Most pairs sit at near-zero IoU (distinct masks); a small subset shows substantial overlap (warm cells). The warm cluster around bezel / lunette / rehaut / metal / lugs corresponds to genuinely co-located or co-referring watch components — some are synonyms (bezel / lunette), some share visual region (peripheral metal surfaces). The bright off-diagonal cells at the compound-noun pairs (dial / metal dial, strap / metal strap) show those still produce highly overlapping masks, even though the all-trainable recipe’s text encoder separated them in text-embedding space.

Synonym pairs — the cleanest case

Six concept pairs in our 37-category set are functional synonyms or near-synonyms by domain convention: bezel ↔ lunette, clockwork ↔ movement, diamonds ↔ stones, pearl ↔ pearls, leather ↔ strap (where leather is the most common strap material), and the singular/plural pair hour hand ↔ minute hand(literal synonyms by function within the “hand” family). These are the pairs where we’d expect the model to behaviourally treat the prompts as referring to the same thing, and indeed at the predicted-mask level it mostly does:

Table 7.2 — Synonym pairs across recipes
Pairfus[4], basefus[4], frozen-TEfus[4], fusion-onlypred_IoU, frozen-TEpred_IoU, fusion-only
bezel ↔ lunette0.770.990.990.810.68
leather ↔ strap0.970.980.970.920.74
pearl ↔ pearls0.980.990.970.580.33
clockwork ↔ movement0.960.910.970.360.67
diamonds ↔ stones0.980.980.980.420.33
hour hand ↔ minute hand0.980.980.990.160.17

Three readings worth pulling out. First, bezel ↔ lunette is the closest thing to “the model learnt the synonymy” we have: high mask overlap (0.81) and high fusion[4] cosine, with the text-encoder cosine substantially moving in the all-trainable recipe (+0.19) from a moderate base of 0.42. Second, hour hand ↔ minute hand is the exception worth noting: literally synonymous in name and function but pred_IoU is only 0.16–0.36 across recipes, because the model genuinely produces masks at different positions on the watch face (these refer to distinct objects that share a kind). The mask-level probe agrees with everyday meaning here, while the fusion-stack probe would have wrongly suggested they’re indistinguishable. Third, recipe choice substantially changes pred_IoU even for synonyms (e.g. pearl ↔ pearls ranges 0.33–0.99 across our five recipes) — pred_IoU is sensitive to how aggressively a given checkpoint produces masks, which we come back to in the caveats below.

The “peripheral-metal” pairs — not as merged as fusion[4] makes them look

When we first looked at fusion[4] cosines for five canonical pairs (lugs ↔ metal, metal ↔ rehaut, bezel ↔ rehaut, bezel ↔ lugs, dial ↔ lugs), they leapt from 0.49–0.66 in base on watch images to 0.89–0.96 in every fine-tune. The natural reading was “the fine-tune collapsed these five concepts into a single latent cluster”. The predicted-mask probe says something more cautious:

Table 7.3 — Peripheral-metal pairs across recipes
Pairfus[4], basefus[4], frozen-TEfus[4], fusion-onlypred_IoU, frozen-TEpred_IoU, fusion-only
lugs ↔ metal0.490.920.920.040.20
metal ↔ rehaut0.620.960.920.190.11
bezel ↔ rehaut0.660.960.950.220.06
bezel ↔ lugs0.570.890.910.060.05
dial ↔ lugs0.600.900.920.100.09
crown ↔ pusher0.940.980.970.760.58

For the four-of-five “peripheral metal” pairs listed first, fusion[4] cosines say “heavily merged” (0.89–0.96) while pred_IoU says “distinctly masked” (0.04–0.22 in the frozen-TE recipe). The model has merged the residual stream for these prompts but is still producing different mask outputs — the decoder is compensating for the residual-stream merge by cross-attending to the image and localising distinct mask regions for each concept. This sharpens our earlier framing: the activation-space “merge” is real, but it doesn’t mean the model behaves as if the concepts are the same. It means the model relies more on visual content (and less on prompt-side residual structure) to discriminate them at the output.

crown ↔ pusher is the genuine exception in this group: pred_IoU 0.76 in frozen-TE, fusion[4] cosine 0.98 in base — the only pair where both probes were already agreeing in base that the masks substantially overlap. These two components sit in nearly the same place on the watch case side, so the model masks roughly the same region whichever prompt you give it. That output-level confusion isn’t an artefact of the fine-tune; it’s a base-model property of how watch images look.

Compound nouns — what only text-encoder rotation can do

Two of the 37 categories are compound nouns of others: dial / metal dial and strap / metal strap. They’re the cleanest probe of what the text-encoder rotation in the all-trainable recipe was doing, because the base text encoder treats compound pairs as highly similar (~0.72 cosine) and the watch fine-tune has to disambiguate them for the mask losses.

Table 7.4 — Compound nouns in two views
Pairtext basetext all-trainablefus[4] FT (range)pred_IoU frozen-TE
dial ↔ metal dial0.7310.408 (Δ −0.32)0.94–0.950.59
strap ↔ metal strap0.7140.318 (Δ −0.40)0.96–0.980.69

Text-encoder rotation in the all-trainable recipe pulls the compound apart by 0.32 and 0.40 cosine — substantial and prompt-specific, and the one mechanism only that recipe produces. But the downstream consequence is modest: predicted masks still substantially overlap (pred_IoU 0.59 and 0.69) in every recipe, and frozen-TE and fusion-only reach the same watch IoU without the text rotation. The text-side compound-noun separation is real and unique to the all-trainable recipe; whether it’s load-bearing is the more honest question, and the answer is essentially no.

Visualising the text-encoder rotation

Because frozen-TE and fusion-only hold the text encoder byte-identical to base, the only thing to look at in text-embedding space is the contrast between base and the all-trainable fine-tune. The 37×37 pairwise-cosine heatmap below switches between those two states (and a Δ view). The full matrix is more informative than the five-pair tables above — you can see which subsets of the watch vocabulary the text encoder pulled apart, which it pulled together, and where the changes leave the unrelated majority of concept pairs alone.

FIG. 7.537 × 37 heatmap · toggle base / FT / Δ

Text-encoder pairwise cosine: base vs all-trainable

37×37 cosine heatmap of text-encoder embeddings. Toggle between base SAM3, the all-trainable fine-tune, and Δ (all-trainable − base). Hover any cell for the exact cosine value. The Δ view uses a diverging scale: red = the pair grew further apart under fine-tuning, green = the pair grew closer.

Caveats about pred_IoU itself

We made pred_IoU the primary measure because it is behaviour-anchored, but it has two specific failure modes the reader should know about. Co-occurrence-driven inflation: pred_IoU between concept A and B is high whenever the model’s masks for both prompts land on overlapping image regions, regardless of whether the model believes A = B. Two concepts that frequently appear in the same image region (e.g. peripheral metal surfaces, watch-face components) can score high pred_IoU even when the model otherwise tracks them as distinct categories. Conversely, two genuinely synonymous concepts whose masks land at slightly different positions on the watch (the hour hand / minute hand case) will score low pred_IoU.

Recipe-specific mask aggressiveness. Across our five fine-tune recipes, pred_IoU on a fixed pair like diamonds ↔ stones ranges from 0.33 in fusion-only to 0.98 in presence_only. That spread reflects how aggressively each recipe’s mask predictor produces large vs. small masks at a given confidence threshold, not how strongly each recipe “believes” the two concepts are the same. Cross-recipe pred_IoU comparisons are therefore sensitive to recipe-specific mask behaviour and should be read with that in mind. Within-recipe pred_IoU rankings (which pairs overlap more than others, holding the recipe fixed) are the cleaner comparison.

Neither pred_IoU nor fusion[4] cosines is a perfect concept- identity probe, and the right move is to read them triangulated: the synonym pairs that show high pred_IoU and high fusion[4] cosine and recipe-invariance are the most defensible “the model treats these as the same concept” cases (essentially bezel ↔ lunette and leather ↔ strap). The rest of the activation-space merges we reported earlier are real residual-stream effects but don’t imply the same thing at the behavioural level.

Concept geometry as a 3D structure

For a side-by-side view of all four similarity metrics — text-encoder cosine, predicted-mask IoU, the “invented confusion” delta, and a combined score — here’s a 3D UMAP that lets you compare. Concepts close in one view are not necessarily close in another.

FIG. 7.6independently configurable panels

Side-by-side 3D UMAP of concept geometry

edge density: keep closest 25% of pairs edge colour: weakeststrongest
Two independently configurable 3D UMAPs. Each panel has its own checkpoint and metric selector; the shared slider controls edge density. Edges are coloured blue → red where the strongest connection is red and the weakest blue. Drag any panel to rotate.
08

And every checkpoint failed catastrophically out of domain

By every measure in §1–§7, all three fine-tunes look healthy. The weight deltas are small and low-rank. Most of the fusion+decoder change fits in a 256-dim residual that base’s downstream can still decode. The text encoder didn’t need to move. Concept geometry reorganised coherently. The decoder co-adapted cleanly when allowed to, and stayed transferable when frozen.

And then we ran the same checkpoints on SA-Co/Gold — the SAM3 paper’s official 1.4 k-image open-vocabulary benchmark. The fraction of correctly refused negative prompts (asking the model to segment something the image does not contain) collapsed from base’s 95.8 % to about 2–3 % on the all-trainable and frozen-TE recipes — near-total failure on open-vocabulary refusal. Average cgF1 dropped from 55.7 to around 5. Every healthy-looking fine-tune we’d built failed in the same way.

Every healthy-looking fine-tune fails in the same way. The cause is sharply localised — but not where the first forensic look at “which parameter moved most” pointed.The cliffhanger

The fusion-only recipe from §6 sits an interesting third of the way back. With the decoder, every head, and the entire vision pathway held byte-identical to base, the same evaluation gives negative-correctness 8.07 % — modestly better than the standard recipes (about 3× their score) but still about 88 points below base. Freezing the decoder clearly helped — the parameters most associated with the broken behaviour can’t move — but the head’s inputs come from upstream fusion, and those did shift. Even with the head’s weights pinned, a residual stream off-distribution enough produces false positives at almost the same rate.

That comparison — standard recipes at 2–3 %, fusion-only at 8 %, base at 95.8 % — turns out to put a sharp number on what is actually broken: roughly a twentieth of the collapse is the downstream head’s parameters moving, and roughly nineteen-twentieths is the upstream residual stream drifting off the input distribution the head expects. The cause is sharply localised, but not where the first forensic look at “which parameter moved most” pointed. How we actually localised the failure (after first chasing the most-moved parameter and finding it was a symptom rather than the cause), how a 162-second retrain recovers most of the loss — on prompts the model has never seen during the retrain — and what it implies for the broader practice of single-task fine-tuning is the topic of Post 2.

09

LoRA on the fusion stack — sizing the first rank from the weight spectrum

The three findings above — that the task localises in one residual tensor at fusion[4], that the parameter update lives in a small subspace, that the fusion-only recipe matches the all-trainable recipe to within five IoU points using just 0.94 % of the model’s parameters — together suggest a question. If the update is a thin perturbation on a single residual tensor, do we even need full-rank gradients on the fusion stack’s 7.9 million parameters? Or can a much smaller adapter, parameterised as explicit low-rank, do the same job? And once we know how to size that adapter from the spectrum, can we push the same recipe to the whole model and beat the full fine-tune at a fraction of the parameters?

We’ll walk this in two arcs. First, fusion-stack LoRA at four rank budgets to ground the math primer and the saturation diagnostic. Then, a Marchenko-Pastur-rank readout from a partial fine-tune, applied per tensor across the whole SAM3 graph — 12.5 M trainable parameters, 0.9232 watch IoU, beating the 660 M-parameter full fine-tune at 1.5 % of its size.

We trained LoRA adapters on the same five fusion layers, with attention and FFN weights both LoRA-able, at four rank budgets (8, 16, 32, 48). Same dataset, same three epochs, same effective batch size, presence head frozen, vision and text encoders frozen. The smallest variant trains 307 000 parameters — 0.037 % of the model; the largest, 1.84 million.

A LoRA math primer (in three equations)

The mechanism that makes the rank knob meaningful, the saturation diagnostic, and the rsLoRA-style α correction all descend from three pieces of linear algebra.

(1) Rank of a matrix. The rank of an m × n matrix W is the number of linearly independent rows (= number of linearly independent columns), bounded by min(m, n). Equivalently, it is the number of non-zero singular values in W’s SVD W = U Σ VT. Rank-r matrices form a subset of m × n matrices of dimension r·(m + n − r), which is much smaller than the ambient m·n when r ≪ min(m, n). This is the parameter-efficiency that makes low-rank adapters interesting.

(2) The Eckart-Young theorem. For any matrix W and any target rank r, the best rank-r approximation (in Frobenius or spectral norm) is Wr = Ur Σr VrT — the SVD truncated to its top r singular values. The approximation error is ‖W − WrF2 = Σi>r σi2 — the squared singular values you threw away. So if a matrix is “effectively low-rank” (most of its Frobenius norm sits in a few large singular values), a small r captures almost all of it. If the spectrum is flat, you need a large r.

(3) LoRA. Hu et al. (2021) observed that fine-tune updates often live in a small intrinsic dimension and proposed to parameterise the update as explicitly low-rank. For a frozen base weight W ∈ ℝm × n, LoRA replaces W with

Weff = W + (α / r) · B · A
where B ∈ ℝm × r, A ∈ ℝr × n, with r ≪ min(m, n); B initialised to 0, A initialised Kaiming-uniform, α a scaling constant.

B·A has mathematical rank at most r; it adds rank-r curvature on top of the frozen full-rank W. The number of trainable parameters drops from m·n to r·(m + n)— for our F[0..4] LoRA-able tensors that’s a 60–250× reduction at r=16.

The scaling factor α / r needs explanation. With B = 0 at init, the LoRA contribution is exactly zero on the first forward, so the model output matches the base. As B grows during training, the per-step contribution is (α/r) · B · A. The original paper’s convention is α = r, which gives scaling = 1.0 and lets you swap ranks without retuning the learning rate — for a fixed rank. It does not, as we’ll see, give comparable behaviour across ranks.

Choosing the first rank from the weight spectrum

Before sweeping ranks, we asked the data what rank we should need. The fusion-only fine-tune has a full-rank weight delta ΔW = WFT − Wbase for each LoRA-able tensor — this is the “ideal” update the LoRA must approximate. SVD this delta and count statistically real singular directions: that gives a defensible upper bound on the rank a LoRA needs in order to capture the same update.

The key obstacle: ΔW is contaminated by gradient noise, floating-point drift, and the implicit-regularization residual of SGD. Not every non-zero singular value is signal. We need a threshold below which a singular value is indistinguishable from random noise. That threshold comes from the Marchenko-Pastur (1967) distribution — the limiting law of singular values of random Gaussian matrices.

For an m × n matrix whose entries are i.i.d. Gaussian with variance σ2, the singular values concentrate inside the interval

[ σ·(√m − √n)+,  σ+]   where σ+ = σ·(√m + √n)
asymptotically, as min(m,n)→∞.

Any singular value of ΔW above σ+ is statistically distinguishable from what an i.i.d. Gaussian matrix of the same scale would have produced — it’s signal. Any singular value below σ+ is indistinguishable from noise. We estimate σ from ΔW’s own Frobenius norm, σ̂ = ‖ΔW‖F / √(m·n) (assuming i.i.d. Gaussian entries, by definition). Gavish & Donoho (2014) sharpen this into an optimal hard threshold; for our purposes the MP bulk-edge gives the right order of magnitude.

Applied to each of the 30 LoRA-able tensors in fusion04_only, the count of signal singular values (i.e. |{i : σi(ΔW) > σ+}| per tensor) gives the table below. That count is the per-tensor “needed rank” if a LoRA wanted to be able to represent every above-noise direction the full-rank update used.

Table 9.1 — Signal rank per tensor group (fusion04_only)
Tensor groupsignal rank (median)min–max
Self-attention out_proj1815–21
Cross-attention in_proj2725–31
FFN linear1 / linear23832–40
All 30 tensors (overall)2715–40

Reading: a LoRA budget below the per-tensor minimum (~15) would cut into signal; a budget at or above the maximum (~40) should comfortably cover everything the full-rank update can do. We sized the sweep to span that range — r=8 well below the noise-floor ceiling, r=16 at the lower bound, r=32 near median, r=48 above the maximum.

Results

Table 9.2 — Watch IoU vs. trainable params, seven recipes
RecipeTrainableWatch IoULVIS mIoUcgF1pmF1neg-correct
base SAM300.230.5655.767.195.8 %
all-trainable840 M0.920.514.939.02.2 %
frozen-TE~660 M0.920.546.740.83.2 %
fusion-only F[0..4]7.9 M0.870.546.847.58.1 %
LoRA F[0..4] r=8307 K0.760.559.356.28.5 %
LoRA F[0..4] r=16614 K0.820.558.855.07.9 %
LoRA F[0..4] r=321.23 M0.840.549.954.111.8 %
LoRA F[0..4] r=481.84 M0.870.539.350.911.8 %

SA-Co cgF1 is the SAM3 paper’s headline open-vocabulary metric — F1 per concept group across the 2 000-pair Gold subsample, penalising both false-positive masks on absent concepts and missed positives. pmF1 is positive-image-only F1, which isolates the over-firing behaviour inside images that do contain the prompted concept.

At r=48 (1.84 M trainables, 0.22 % of the model) LoRA matches fusion-only’s in-domain watch IoU within rounding (0.87 vs 0.87) using 4.3× fewer parameters than fusion-only itself. It is still 5 IoU points below the best full-recipe number (0.92) — the gap consistent with vision-encoder co-adaptation that fusion-only scope cannot replicate by construction.

Loss curves — the per-sample picture

Comparing training losses across recipes that used different per-device batch sizes requires a small correction. The SAM3 trainer multiplies its reported loss by √(per-GPU batch size)— a code path that’s on by default and scales the reported number, not the optimiser’s objective. The full-recipe runs used per-GPU batch 1 with 8-step gradient accumulation; the LoRA runs used per-GPU batch 16 with no accumulation. Effective gradient batch is similar (16–32 across recipes), but the displayed loss is multiplied by 1.0 in the first case and by 4.0 in the second. Dividing each curve by √bs recovers per-sample loss and lets the seven runs be compared directly:

FIG. 9.3log scale · √bs-corrected

Per-sample loss curves, seven recipes

Hover any curve for exact (step, loss). Log y-axis; dashed verticals mark epoch boundaries (one epoch = 2 656 steps). The y-axis is train_all_loss ÷ √(per-GPU batch size), undoing the trainer’s scale_by_find_batch_size multiplier so per-sample losses are comparable across recipes.

After the correction the picture is much cleaner. Four readings:

  • The all-trainable and frozen-TE curves are nearly indistinguishable. Disabling text-encoder gradients doesn’t change the training loss in any visible way — consistent with the <0.6 % weight motion the text encoder showed in §1 even when it was allowed to move. Both end near per-sample loss 3.21.
  • Fusion-only plateaus about one unit above the full recipes. Training converges to per-sample loss 4.26 vs 3.21— the decoder, presence head and vision pathway are all frozen, so a residual portion of the matched auxiliary losses (cross-entropy + dice + box across five intermediate decoder layers) can’t be reduced. The gap is small but real and consistent with the 5-IoU-point in-domain gap versus the full recipes.
  • LoRA plateaus ~0.1–0.8 per-sample units above fusion-only. On a per-sample basis LoRA r=48 sits at 4.37 (only +0.11 above fusion-only’s 4.26); r=8 sits at 5.02 (+0.76). The gap is real but modest — well within the headroom the rank-saturation diagnostic later reveals.
  • Rank corresponds cleanly to per-sample loss. After the correction the four LoRA curves form a clean monotone stack — r=8 highest, r=48 lowest, with the vertical gaps shrinking at each doubling. Diminishing returns in the loss show up before the IoU numbers do. (Caveat to keep in mind: this monotone-by-rank picture is at the standard α = r scaling — later sections show that the same ranks at higher α produce a very different stack, which is the real lever here.)

Same circuit, different intensity

All four LoRA variants reproduce the fusion-only mechanistic fingerprint. The activation-patching probe from §2 (swap the FT’s fusion[4] tensor into base, measure IoU recovery) gives 93.9 % for r=48, climbing monotonically from 87.2 % at r=8. Patching at fusion[0] gives 0–2 % for every LoRA rank, just like fusion-only — the early residual is unchanged. The per-head cross-attention-text ablation from §7 returns the same top-5 (layer, head) set across ranks. The residual CKA at fusion[0] (0.93–0.97 for LoRA vs 0.63 for all-trainable) places every LoRA variant in fusion-only’s neighbourhood, not in the full-fine-tune neighbourhood. Rank changes how cleanly the same circuit reads out; it does not change which circuit.

Out of domain — LoRA forgets a two-axis problem differently

The catastrophic-forgetting picture from §8 returns here with an interesting twist. The neg_correct rate (correctly predicting an empty mask when the prompted concept is absent) and the LVIS-val positive-only mIoU (held-out segmentation when the concept is there) move in opposite directions with LoRA rank:

Table 9.4 — Two-axis OOD trade-off across LoRA ranks
RankWatch IoU (in-domain)LVIS-val mIoU (pos)SA-Co neg-correct
80.760.5548.5 %
160.820.5507.9 % (worst)
320.840.54011.8 %
480.870.53211.8 %

Lower rank does protect base ability on the positive side (LVIS-val drops from 0.564 base → 0.554 at r=8, only 1.7 %). But on the negative side, lower rank is worse, not better — r=16 bottoms out at 7.9 % refusal correctness, recovering to 11.8 % only at r=32 and above.

The mechanism falls out of one number we can extract from every checkpoint: the ratio of fusion[5]’s output norm in the fine-tune to its output norm in base. LoRA cannot retrain the presence head (dot_prod_scoring is frozen). The only way to push the head’s logits up for watch concepts is to inflate the fusion[5] residual norm so the unchanged scoring weights produce bigger scores. Low rank means a coarse direction, so the residual has to be amplified harder for the same in-domain logit — and the amplification accidentally fires the head on absent concepts. Higher rank shapes the fusion direction more precisely; less amplification is needed; fewer false positives. The norm-inflation ratio at fusion[5] traces exactly the same curve as neg_correct in reverse: 1.135 at r=16 (worst), 1.110 at r=32, 1.093 at r=48 (matches fusion-only). Different rank, different direction quality, different hallucination rate.

The contrast with the full-rank recipes is telling. The all-trainable and frozen-TE fine-tunes hardly inflate fusion[5]’s norm at all (~1.01) — they instead retrain the presence head itself to fire more aggressively on watch concepts. Different parameters move, same downstream effect, except worse: their neg_correct sits at 2–3 %, an order of magnitude below the LoRA variants’ 8–12 %. The frozen head ends up being a useful regulariser by accident: parameter-efficient training cannot damage the negative-prompt rejection as comprehensively as a full fine-tune can.

10

Rank saturation and the α-boost

Is the rank saturated? A single-checkpoint diagnostic

Curves flatten for many reasons — rank running out is only one. The same SVD-and-Marchenko-Pastur machinery that sized our initial rank sweep also tells us, from a single trained LoRA checkpoint, whether the rank budget was actually the binding constraint.

Compute the SVD of the effective LoRA update Δ = (α/r)·B·A. By construction the matrix product B·A has mathematical rank at most r, so its SVD has at most r non-zero singular values σ1 ≥ σ2 ≥ … ≥ σr. Three statistics measure whether those r directions are all carrying signal or whether some of them are at the random-matrix noise floor:

σr / σ1 — head-to-tail ratio. If close to 1, every direction has comparable energy and r is fully utilised. If close to 0, the last direction carries almost nothing and the model is using fewer than r dimensions.

|{i : σi(Δ) > σ+}| / r — fraction of the rank budget statistically above the Marchenko-Pastur floor (with σ+ computed the same way as in the initial rank-choice analysis). 100 % means every direction is signal; 20 % means 4/5 of the rank is wasted on directions indistinguishable from random noise.

reff / r — entropy-based effective rank (Roy & Vetterli 2007): reff = exp(H(p)) with pi = σi2 / Σσ2. Equal energies → reff = r; spectrum dominated by one direction → reff → 1.

Three thresholds together class the run:

Table 9.5 — Rank-saturation diagnostic thresholds
DiagnosticWhat it measuresRank-limited if
σr / σ1head-to-tail ratio of the r singular values of Δ> 0.1
|{σi > σ+}| / rfraction of the rank budget above random-matrix noise> 70 %
reff / rentropy-based effective rank as a fraction of budget> 70 %

The Marchenko-Pastur floor is the bulk-edge of the singular-value distribution of a Gaussian matrix with the same Frobenius norm as Δ — anything below that line is statistically indistinguishable from noise. When all three diagnostics are above their thresholds, the flat curve is rank; when all three are below, it is data, optimiser, or scope.

Applied to our four LoRA checkpoints, averaged over the 30 LoRA-able tensors:

Table 9.6 — Rank-saturation diagnostics per LoRA rank
Rankσr / σ1|{σi > σ+}| / rreff / rVerdict
80.06551 %30 %borderline rank-limited
160.03936 %22 %borderline
320.02325 %17 %not rank-limited
480.01619 %15 %not rank-limited

Reading: r=8 and r=16 are still rank-constrained — doubling rank should give a real lift, and does (+5.6 and +1.9 IoU points respectively at the next doubling). r=32 and r=48 have huge spare capacity (37 of r=48’s 48 singular directions are at noise). The +3 IoU between them is not bought by filling new spectral directions.

The α-scaling math — why rank and scaling are coupled

The convention α = r from the original LoRA paper gives scaling = 1.0 at every rank and was sold as letting you change r without retuning the LR. But that is only true at fixed rank. Across ranks, something subtler happens. With the standard LoRA init (B = 0, A ~ Kaiming-uniform with row-rms ~1/√n), the magnitude of the learnable matrix B · A grows during training in a way that depends on r. Concretely, when LoRA is trained to convergence with a fixed learning rate, the per-element magnitude of B (the only thing that grows from zero) scales as roughly 1/√r: more rank means the gradient signal is spread across more parameters, and each individual parameter receives less update per step.

The effective magnitude of the update (α/r) · B · A therefore scales as

‖(α/r) · B · A‖ ∝ α / (r · √r) = α / r3/2
per-step contribution under standard scaling at fixed LR.

With α = r, the per-step contribution shrinks as 1/√r: higher rank is implicitly annealed even though the loss target is the same. This is what rsLoRA (Kalajdzievski 2023) formalises. The rsLoRA prescription is to replace the scaling factor:

α / r  ⟶  α / √r
Equivalently, in the existing α/r framework: choose α = c · √r for some constant c.

Under α = c·√r, the scaling factor becomes α/r = c/√r, and combined with ‖B·A‖ ∝ 1/√r the per-step update magnitude becomes rank-stable rather than collapsing. The constant c sets the absolute LR scale — we anchored it to our existing r=32 single-checkpoint datapoint by picking c ≈ 7.95(i.e. α = 45 at r=32, scaling factor 1.41×). For the other ranks this prescribes:

Table 9.7 — rsLoRA α prescription, c ≈ 7.95
Rankα (rsLoRA, c=7.95)Scaling factor α/rvs default α=r
822 (≈ 7.95·√8)2.75×+175 %
1632 (≈ 7.95·√16)2.00×+100 %
3245 (≈ 7.95·√32)1.41×+41 %
4855 (≈ 7.95·√48)1.15×+15 %

Two predictions follow before any new training:

  • The boost effect should be biggest at small rank and shrink with rank. The boost factor under rsLoRA decreases as 1/√r relative to the default α/r = 1; if α is genuinely the lever, the IoU gain from boost should track that shape.
  • If the +0.032 IoU gap from r=32 to r=48 is mostly scaling (not capacity), the boost at r=32 should close most of that gap.

The α-boost ablation across ranks

We retrained the standard r=32 for a same-time seed-noise baseline, plus four rsLoRA-boosted runs (one per rank), keeping everything else identical.

FIG. 9.8dropdown selects rank

Standard α=r vs rsLoRA α=c·√r, per rank

Dropdown selects rank; chart shows the standard α = r run (grey) vs the rsLoRA-boosted α = c · √r run (amber) for that rank. Hover any line for (step, loss).

The per-sample final losses (smoothed, last training step):

Table 9.9 — Per-sample loss and IoU, std vs boost, per rank
Rankstd loss (α=r)boost loss (rsLoRA α)Δ lossstd IoUboost IoUΔ IoU
85.094.60−0.490.7620.842+0.080
164.604.31−0.290.8180.856+0.038
324.344.19−0.150.8430.861+0.018
484.204.12−0.080.8690.867−0.002

Both columns track the same monotone pattern in the direction the rsLoRA math predicts: the smaller-rank end gets the biggest boost. r=8 gains +0.49 in per-sample loss and +0.080 in IoU; r=48 is at the floor (−0.08 / −0.002). r=32 with the boost (0.861) closes most of the gap to standard r=48 (0.869), with the same rank-32 capacity. r=16 with α=32 (0.856) even passes standard r=32 (0.843) with half the parameters. The shape matches.

What the shape doesn’t pin down is the constant. rsLoRA’s literature default c ≈ 8 is the value at which the boost starts to matter; whether that value is also the optimum, or whether IoU keeps climbing past it, is a separate question. The next three subsections chase it — first asking whether the boosted LoRAs are even using all their rank, then sweeping α past rsLoRA’s prescription until something breaks, then explaining why the peak ends up so far above the literature default.

Testing the prediction — were our LoRAs “full”?

The boost ablation showed rsLoRA-aligned α (c ≈ 8) improves IoU at low rank. But did the boosted LoRA actually use the rank we gave it? Recap of the rank-saturation diagnostic from earlier in this section: SVD the trained B·A product, count singular values above the MP threshold. For the fusion-scope r=32 LoRA the per-tensor signal-rank ratio averaged ~17/32 at convergence — about half the budget actually used. Plenty of capacity left, but the IoU still trailed the full FT by 5 points. So rank was not the bottleneck: the LoRA had room it wasn’t using. Something else was throttling the per-step update.

The reason ties back to the math primer at the top of the section. In standard LoRA, the per-step update to the effective weight is η · (α/r) · ∇L. With α = r (the original Hu et al. default), α/r = 1 and the LoRA branch sees the same effective learning rate as the base parameters. But the base model’s LR is tuned for full fine-tuning — calibrated for gradients on a 660 M-param graph, not on a 30 M-param LoRA branch that starts at zero and needs to grow. The mismatch is exactly the factor by which α should be boosted. The next subsection is the sweep that pins down how much.

But how high should α actually go?

The rsLoRA prescription α = c · √r works for a fixed c; c ≈ 7.8 (so α=22 at r=8) was the literature default we started from. Our boost runs above already used a larger c implicitly — α=64 at r=8 corresponds to c ≈ 22.6, almost three times the rsLoRA default. The IoU kept climbing. So how far up does the curve go before something breaks?

We pushed r=8 LoRA past the rsLoRA prescription in a clean ladder: α ∈ { 16, 22, 40, 64, 96, 128, 160 }. All seven configurations train the same five fusion layers, identical in every other respect.

FIG. 9.15α from 16 (under-driven) to 160 (over-driven), 7 runs

r=8 LoRA — training loss across the full α ladder

All seven r=8 LoRA runs at increasing α. Hover any curve for (step, smoothed per-sample loss); rightmost legend entry includes the final watch-IoU number. Lower loss consistently corresponds to higher IoU until α=160, where loss and IoU both reverse — the saturation point.
Table 9.15 — r=8 LoRA, full α-ladder
αscaling α/rrsLoRA c = α/√rend-losswatch IoUΔIoU vs α=22
162.0×5.75.250.816+0.054
22 (~rsLoRA)2.75×7.84.830.7620
405.0×14.14.75
648.0×22.64.700.836+0.074
9612.0×33.94.150.876+0.114
12816.0×45.34.300.885+0.123
16020.0×56.64.750.869+0.107

The peak is at α = 128 (boost factor 16×, equivalent to rsLoRA c ≈ 45) — about six times the rsLoRA default and giving +0.12 IoU over the rsLoRA-aligned baseline. At α=160 the loss climbs back up and watch IoU drops 0.016. The optimum is much higher than any rsLoRA-style prescription would suggest, but there is one — and you find it by sweeping. rsLoRA pointed the right direction; the constant it prescribes is roughly six times too small for SAM3.

Why is the optimum so high? (Probably because the LR was too low.)

The per-step gradient update to the effective weight in LoRA is η · (α/r) · ∇BL · A (and symmetrically for A). The product η · (α/r) is the only thing that matters for how far the weight moves per step. Whether you double η or double α/r, the update size is the same.

SAM3’s training recipe uses a transformer base LR of 8 × 10-4 multiplied by lr_scale = 0.1, peaking at 8 × 10-5 — a deliberately conservative setting tuned for full fine-tuning. When you wrap a LoRA adapter on top, the optimiser sees a tiny new parameter group whose effective influence on W is scaled by α/r. At the default α = r, the LoRA branch’s effective LR is identical to the base — calibrated for full-FT gradients, not for adapter gradients that start at zero and need to grow far. The α-boost is just an offset that gets the adapter’s effective LR back into a useful range. At α = 128, r = 8 the per-step weight-update magnitude is 16× the standard LoRA setting — equivalent in shape to having used a 1.28 × 10-3 LR with α=r=8, which is also plausibly “the right number” for SAM3.

We checked the substitution at three boost factors on r=8 to see how well the equivalence holds in practice:

Table 9.21 — α-vs-LR exchange. r=8 LoRA, standard α=r=8 with LR scaled k× vs α=k·r=8k with default LR.
boost kα-boost (α=8k, LR=default)LR-boost (α=8, LR=k×)Δ (LR − α)
baseline (k=1)0.7620.7620
0.8360.8696+0.034
12×0.8760.8538−0.022
16×0.8850.8625−0.022

At 8× the LR-boost actually wins. At 12× and 16× the α-boost wins. The substitution is approximately right but not exact, and the relationship isn’t even monotonic in the LR direction — lr=12× (0.8538) sits below both lr=8× and lr=16×.

Two things break the symmetry slightly. First, the boost only affects the LoRA branch, not the base parameters — so it lets us push the adapter while everything else stays in its original LR regime. Second, the boost changes the effective LR only on the joint B · A product, not on B or A independently; the Adam optimiser’s per-parameter adaptive scaling on each of B and A still sees the raw gradient. Raising LR scales every parameter’s update uniformly, but Adam then normalises by the per-parameter second-moment estimate vi— larger uniform updates show up in larger vi on the next step, which damps them back down. The two paths give the same effective per-step ΔW in plain SGD but diverge under Adam.

Practically: prefer α-boost over LR-boost when pushing past rsLoRA. LR works up to a moderate boost (~8×) but starts fighting Adam at higher values; α stays well-behaved much further. The boost benefit also caps sub-linearly with α because eventually the per-parameter Adam scale and the effective LR mismatch start to fight each other.

Statistical significance of the boost effect

The 180-task eval set lets us run paired stats per rank (same task ID, std vs boost on both). Three things to check: the direction of the effect (sign test), the magnitude (paired t / bootstrap CI), and the effect size(Cohen’s d on paired diffs).

Table 9.10 — Paired-stats per rank, boost vs std
Pair (boost vs std)α/rmean Δ IoUwins / lossessign-test ppaired-t pCohen’s d
r=16 α=32 vs r=162.00×+0.038118 / 612.5 × 10⁻⁵2.5 × 10⁻⁴0.28
r=32 α=45 vs r=321.41×+0.018113 / 665 × 10⁻⁴0.0260.17
r=48 α=55 vs r=481.15×−0.003113 / 654 × 10⁻⁴n.s.−0.03
r=8 α=128 vs r=816.0×+0.123147 / 331.1 × 10⁻¹⁸3.5 × 10⁻¹⁰0.50

Reading the rows:

  • r=16 is unambiguously significant in both direction AND magnitude. Paired-t p = 2.5 × 10⁻⁴, Cohen’s d = 0.28 (above the conventional 0.20 “small” threshold), 118 wins vs 61 losses on 179 paired tasks. The α-boost effect at 2.0× is real, not seed noise.
  • r=32 is significant in direction but marginal in magnitude. Sign test p = 5 × 10⁻⁴ (113 wins vs 66 losses) but paired-t p = 0.026 (just under 0.05), Cohen’s d = 0.17 (below “small”). The +0.018 mean is dragged up by a noisy tail of large wins partly cancelled by large losses; median Δ ≈ 0.
  • r=48 is the interesting null. Direction still highly significant (113 vs 65, sign-test p = 4 × 10⁻⁴ — α=55 helps more tasks than it hurts) but the mean is negative (−0.003) because the wins are smaller and the few losses are concentrated in fine-detail categories. The 1.15× boost factor is too small to move the needle on aggregate IoU.
  • r=8 α=128 is the strongest signal in the table. mean ΔIoU = +0.123 over 180 paired tasks, 147 wins vs 33 losses, paired-t p = 3.5 × 10⁻¹⁰, Cohen’s d = 0.50 (medium effect size — bigger than any other row). At low rank with very high α the boost is unambiguously real and large; this is the saturation-sweep result viewed through the same paired-statistics lens, and it matches the rsLoRA-prescribed direction at a much larger boost factor.
  • Three independent rank points + the r=8 paired comparison all lining up with the rsLoRA-predicted monotone shape (effect size shrinks as boost factor shrinks) is much stronger evidence than any single comparison alone. With 4 paired data points and matched-direction effects, the probability of seeing this pattern under “α-scaling does nothing” is well below 10⁻⁴.

The catastrophic-forgetting trade-off

The α boost improves in-domain IoU at every rank. But the cgF1 / pmF1 / neg_correct numbers tell the catastrophic-forgetting half of the story, and there the sign reverses:

Table 9.11 — In-domain vs OOD across boost ranks
RecipecgF1pmF1neg_correct %watch IoU
base SAM355.7267.0895.770.23
r=16 std8.8454.977.920.818
r=16 α=328.0551.517.990.856
r=32 std9.1652.5410.530.843
r=32 α=459.6752.8811.440.861
r=48 std9.2950.8811.800.869
r=48 α=558.5051.199.430.867
r=8 α=228.8552.969.530.842

At every rank where the boost helps in-domain IoU, it slightly hurts at least one OOD metric (cgF1 ↓ ~0.5 on average, pmF1 ↓ ~1.5, neg_correct ↓ 1–2). The explanation is mechanical: the boost increases the per-step update magnitude, which means the same loss target is reached by a larger weight move — pushing the model further from base and amplifying the inflation of the fusion[5] residual that drives the frozen presence head (see earlier). α boost trades a small slice of OOD robustness for a moderate in-domain IoU gain.

Did the math hold? — a scorecard

Table 9.14 — Mathematical predictions vs. empirical results
Mathematical predictionEmpirical resultVerdict
MP signal-rank ceiling at ~40 ⇒ r=48 is enough capacityr=48 matches fusion04_only IoU (0.869 vs 0.870)confirmed
r=32 / r=48 are not rank-saturated under standard scalingα boost at fixed rank gives substantial IoU gains; r=32→r=48 collapses with α adjustmentconfirmed
rsLoRA: boost effect ∝ 1/√r relative to default ⇒ small-rank gets bigger gainΔ IoU: 0.080 (r=8) > 0.038 (r=16) > 0.018 (r=32) > 0 (r=48). Monotone.confirmed
rsLoRA prescribed constant c ≈ 8 is the right α magnituder=8 α-ladder peaks at α=128 (c ≈ 45), +0.12 IoU above α=22shape right; constant 6× too low
Direction of α-boost effect at r=48 (sign test still 113/65)Sign-test p = 4 × 10⁻⁴ even though mean Δ is 0direction confirmed; magnitude null
α boost preserves the same circuit (fusion[4] localisation, same heads)Per-head ablation top-K identical pre / post boost; patching curves overlayconfirmed
α boost incurs small OOD penaltycgF1 ↓ ~0.5, pmF1 ↓ ~1.5, neg_correct ↓ ~1–2 pp at every gaining rankconfirmed

Score so far: 6 of 7 in direction; the last row is the interesting one. rsLoRA’s shape prediction — smaller-rank gets the bigger boost, monotone across ranks — holds. Its prescribed constant does not: c ≈ 8 sits a factor of six below the IoU peak. Treat the literature default as a starting point for the sweep, not the answer. The outstanding question is whether the same diagnostic-driven recipe scales beyond the fusion stack — that’s what the next two sections take up.

11

Calculating the needed LoRA rank

Three ways to size the LoRA

All the rank prescriptions above came from an SVD of a frozen-TE fine-tune we already had. For a brand-new task you face a chicken-and-egg: to pick the LoRA rank you need an SVD of the FT’s ΔW, but doing the FT is exactly what LoRA is supposed to avoid. Three answers, in order of preference:

  • Partial fine-tune until the MP-rank converges. A short run (~25 % of one epoch on this task) gets you the per-tensor MP-rank you’d see at full convergence. SVD the in-progress ΔW, read off the rank, train the LoRA from the prescription. Primary method.
  • Saturation diagnostic on a small LoRA (VRAM-limited case). Train a low-rank LoRA you can afford to fit, then SVD the trained B·A product and apply the single-checkpoint MP test from §10. The number of singular values above the MP threshold is a lower bound on the rank you should provision. If it saturates the budget, retrain at higher rank; if not, you’re done.
  • SVD a generously sized LoRA. Train one high-rank LoRA (say r=64 or r=128 uniformly across all target tensors), SVD the per-tensor B·A products at intervals, and stop once the per-tensor MP-rank stabilises — then use those ranks as the prescription for a tight retrain. Comparable wall-clock to method (1) (rank also converges in the early steps here) but cheaper per step because the LoRA never back-props through the base model, so it scales gracefully when full FT memory is unavailable.

The rest of this section covers method (1) in detail.

The SVD rank trajectory across training

We ran a fresh frozen-TE FT and computed, at grad-steps 50/100/200/400/800/1200/2000 plus end-of-epoch 1/2/3, the SVD of W(t) - W(0) for every one of the 259 LoRA-able matrices. Two rank measures per matrix:

  • MP-rank: count of singular values above the Marchenko-Pastur noise threshold — σ+ = σrms · (√m + √n) / √min(m,n) for an m×n matrix. This is the “task signal” rank — the count we use to size the LoRA.
  • e90: number of singular values capturing 90 % of the Frobenius energy. Counts signal + noise indiscriminately; shown to make the noise growth visible.
Table 9.17 — Per-step rank trajectory (median across 259 LoRA-able tensors). Same frozen-TE FT recipe that produced the 0.9215 IoU reference.
grad-steplabelMP-rank (signal)e90 (signal + noise)train loss
5061185.6
10084747.3
200117427.6
400138617.1
800159811.9
1200161029.96
2000161078.21
5313end ep 1161185.98
10626end ep 2161263.77
15939end ep 3161203.95

Three things to read off this table:

  • MP-rank stabilises by step ~1200— less than 25 % of one epoch — and stays at median 16 through another three full epochs. The task knows which directions carry above-noise signal very early.
  • e90 keeps climbing throughout: 11 → 102 → 120. The full FT’s ΔW lives in a much wider subspace than MP-rank suggests, but the extra ~100 directions are optimisation noise — the Adam adaptive scaling keeps nudging every parameter slightly even when there’s no consistent gradient signal.
  • Train loss continues to drop at constant MP-rank. From step 1200 to epoch 3, loss falls 9.96 → 3.95 (~2.5× reduction) without the signal rank changing. The model is refining magnitudes within the same subspace, not adding directions.

Why noise grows and signal doesn’t

The Marchenko-Pastur threshold formalises one specific question: how many singular values is this matrix expected to have above the noise floor of a same-shape iid Gaussian matrix with the same scale? Below σ+ is by construction indistinguishable from random fluctuations of gradient updates that don’t carry consistent task signal. Above σ+ is everything that beats noise.

σ+ = σrms · (√m + √n) / √min(m, n)

The interpretation of the table now writes itself. Optimisation noise accumulates: Adam keeps moving every parameter slightly on every step, even when the gradient on that direction is dominated by the optimiser’s own adaptive scaling rather than task structure. Over thousands of steps these tiny moves fill out a Gaussian-shaped cloud of directions, raising e90 without affecting MP-rank. Task signal saturates fast: the directions that consistently move the model toward lower loss are set once the gradient direction settles, which happens long before the loss does. After that, more training is gradient descent inside an already-chosen subspace.

Operationally: do not use e90 to size your LoRA. Do use MP-rank. The two diverge by 7× at convergence on this task.

An alternative path: AdaLoRA discovers its own rank

While MP-on-a-partial-FT works for SAM3, it’s still task loss filtered through SVD. A more direct “learn the rank” approach is AdaLoRA: reparameterise the LoRA update as P · diag(λ) · Q instead of B · A, train λ jointly with the task loss under an L1 penalty that pushes unused singular values to zero. The rank is what the model itself decides to use.

We tried this. Out of the box it underperformed by a wide margin — AdaLoRA at the full-model spec hit only 0.838 watch IoU at 3 epochs, ~8.5 points below the standard-LoRA equivalent. Several α-regime ablations didn’t move the result much (range 0.68–0.74 at 1 epoch). The fix eventually came from a small change to the initialisation:

Table 9.19 — AdaLoRA initialisation matters more than the regulariser tuning.
InitPλQ1-ep IoU3-ep IoU
AdaLoRA paper defaultorthogonal0orthogonal0.73890.8375
LoRA-style0Kaiming-uniformorthogonal0.88940.9083

The reason is in the gradient-flow timing. With λ = 0, only ∂L/∂λ is non-zero at step 1 (because the gradients on P and Q both flow through λ which is zero). All three matrices stay frozen for a step, and the model wastes warm-up time on a single-parameter ramp-up. With P = 0, λ = Kaiming, Q = orthogonal the analogue of standard LoRA’s (B=0, A=Kaiming) kicks in: ΔW is still zero at step 0 (because P=0), but the gradient on P is non-zero from step 1 (because λ and Q are). All three tensors are updating from step 2 onward. The 15-point IoU jump at 1 epoch (+0.150) is the warm-up advantage cashed in.

Even with this fix, AdaLoRA only closes most of the gap to standard LoRA — 0.908 vs 0.922 at 3 epochs. The 3-tensor coupling (P, λ, Q) is intrinsically slower than the 2-tensor (B, A). The right framing is AdaLoRA is a useful diagnostic for which tensors to drop entirely, but standard LoRA from an MP-SVD prescription remains the best training recipe.

Are AdaLoRA’s “zero rank” flags safe to act on?

AdaLoRA found 5 tensors converging to effective rank 0 over 3 epochs — the L1 penalty drove all their λ values toward zero. Cross-check against the MP-rank prescription: the partial-FT SVD independently zeroed out 9 tensors of its own (mostly detection-side auxiliaries: boxRPB_embed, reference_points, bbox_embed, label_embed). Only 1 of the 5 AdaLoRA-zero tensors (geometry_encoder.points_direct_project) also has MP-rank = 0. The other 4 carry non-trivial MP-rank in the SVD diagnostic:

  • backbone.vision_backbone.trunk.blocks.1.attn.qkv — MP-rank 24
  • backbone.vision_backbone.trunk.blocks.2.attn.qkv — MP-rank 25
  • geometry_encoder.points_pool_project — MP-rank 21
  • geometry_encoder.points_pos_enc_project — MP-rank 5

The two diagnostics disagree on 4 of 5 tensors. AdaLoRA’s “effective rank 0” label is a much stronger claim than the SVD’s — one says the tensor genuinely carries no task-relevant signal, the other says the L1+ orthogonality regulariser combo can suppress whatever weak signal a tensor with 20+ usable singular directions carries.

12

Validating the recipe — full-model LoRA at 1.5 % of the parameter budget

Per-tensor heterogeneous rank — the diagnostic’s prediction

The saturation diagnostic showed the rank ceiling is heterogeneous across tensors: vision-trunk projection matrices stay signal-dense out to r ≈ 48, attention in_proj matrices want ~8–30, and FFN linear1 / linear2 tensors converge to just 1–4 signal directions. The natural prescription: give every tensor a rank matched to its own MP-rank measured from the SVD trajectory at step 1200 (where MP-rank has settled — see §27). 250 tensors get LoRA adapters; 9 with MP-rank=0 (mostly detection-side auxiliaries that don’t see watch-task gradient signal) are skipped entirely.

A compact view of the prescription by submodule kind — each tensor gets its own integer MP-rank, but tensors of the same kind cluster in similar ranges:

Table 9.12 — Heterogeneous per-tensor MP-rank prescription (full-model, 250 tensors)
Module pattern# tensorsrankα (4× formula)α/r
vision_trunk vision_proj3240–54 (med 47)640–864~16×
vision_trunk vision_qkv3220–30 (med 24)396–484~18×
vision_trunk FFN (fc1 / fc2)645–28 (med 15)200–468~22×
fusion / decoder attention in_proj305–11 (med 8)200–292~30×
fusion / decoder attention out_proj3015–19 (med 16)332–384~22×
fusion / decoder FFN (linear1 / linear2)301–5 (med 2)88–200~60–88×

Total: 12.5 M trainable parameters — 1.5 % of base SAM3. The α formula is α = 4 × max(4·r, ⌈22·√r⌉) per tensor — rsLoRA c = 88, four times the SAM3-calibrated baseline (c ≈ 22). The factor of four is read off §10’s α-ladder: the r=8 optimum sat near rsLoRA c ≈ 45, and the same boost factor applied per tensor generalises to other ranks via the rsLoRA c · √r scaling. FFN tensors with tiny rank (r=1–5) get the largest α/r ratios (60–88×) because the rsLoRA c·√r prescription scales gently with rank.

The end-to-end test — MP-rank prescription + 4× α

Now we have both pieces. The per-tensor MP-rank prescription (Table 9.12 above) says where the capacity should sit. The α-ladder in §10 says how hard to drive each adapter. Combine them and apply to the entire SAM3 graph — not just the five fusion layers, but the 32 vision-trunk blocks, the geometry encoder, the decoder, the mask predictor, the presence head. 250 tensors get LoRA adapters at their own MP-rank, with α = 4 × max(4·r, ⌈22·√r⌉) per tensor (rsLoRA c=88, four times the SAM3-calibrated baseline). 9 tensors with MP-rank = 0 are skipped. Total trainable parameter count: 12.5 M — about 1.5 % of the base model.

Table 9.13 — Watch IoU vs trainable parameters across recipes
Recipetrainable% of basewatch IoU
base SAM3 (no fine-tune)00 %0.227
LoRA F[0..4] r=48 (std α)1.84 M0.22 %0.869
LoRA F[0..4] r=8, α=128307 K0.037 %0.885
fusion04_only (full-rank)7.9 M0.94 %0.870
full-model LoRA, MP-rank + 2× α12.5 M1.5 %0.9115
full-model LoRA, MP-rank + 4× α12.5 M1.5 %0.9232
frozen-TE full FT~660 M79 %0.9215
all-trainable full FT~840 M100 %0.916

The MP-rank LoRA at 4× α beats the frozen-TE full fine-tune: 0.9232 vs 0.9215.53× fewer trainable parameters. The per-tensor MP-rank diagnostic gives the right capacity allocation across tensor types — including FFN tensors at r=1–5 — and the 4× α multiplier from §10’s ladder lets each tensor’s adapter actually train into that capacity. At 2× α the same rank spec lands at 0.9115; at 4× α it’s 0.9232. The +0.012 IoU between the two is pure α — the rank prescription is identical.

Two things did the heavy lifting. First, α matters a lot — it’s an effective learning-rate knob on the adapter, and the optimum sits far above the literature defaults (α = r, rsLoRA c ≈ 8). At r=8 the peak is near α=128 (boost 16×); at the heterogeneous spec the 4× α formula matched the frozen-TE FT. Second, the noise-cleaned MP-rank diagnostic held up empirically: per-tensor capacity from the partial-FT SVD plus a strong α is enough to recover full-FT IoU at 1.5 % of the parameters. Rank tells you where capacity goes; α tells you how hard to push each piece.Two lessons

Takeaways

Next in the series · Post 02
~ 18 min read

How a single LayerNorm bias broke SAM3’s generality — and how we fixed it in 162 seconds of training time

Topline metrics under-determine which internal mechanism a model is using. Three recipes here landed within five points of each other on watch IoU. Inside, they look very different — and three different out-of-domain failure rates from the same in-domain target. Mechanistic interpretability is what closes that gap — not a luxury for exotic models, a default sanity check on any non-trivial fine-tune.The methodological point

All figures in this post are computed from the actual checkpoint artifacts; every interactive plot above is the live data, not a render of a precomputed image.

Notes & references

1. All IoU numbers are reported on a held-out watch test set of 412 images, stratified by watch family.

2. Marchenko, V. A.; Pastur, L. A. (1967). “Distribution of eigenvalues for some sets of random matrices.” Mathematics of the USSR-Sbornik.

3. Hu et al. (2021), “LoRA: Low-Rank Adaptation of Large Language Models”, arXiv:2106.09685.

4. Kalajdzievski (2023), “A rank stabilization scaling factor for fine-tuning with LoRA”, arXiv:2312.03732.

5. Gavish & Donoho (2014), arXiv:1305.5870.

6. Roy & Vetterli (2007), “The effective rank: A measure of effective dimensionality”.

© 2025 aj research
SAM3 series · post 01