Research notes on machine learning, interpretability and the things we find inside the models.

Long-form write-ups from open-ended experiments — mechanistic interpretability, fine-tuning recipes, unexpected failure modes. Each post tries to leave a small, transferable lesson behind.

Latest

Mech-interp · SAM3 series · Post 02

How a 162-second retrain recovers SAM3's open-vocab refusal

Post 01 left SAM3 with a catastrophic-forgetting failure: open-vocab refusal collapsed from 95.8 % to 3.2 % on SA-Co/Gold. This post tests three data-side recipes (replay, replay with negatives, post-hoc recovery), explains mechanically why each one stops where it does, and shows a 132 k-parameter retrain of a single MLP that recovers more refusal than any of them in 162 seconds of training. Post 2 of the SAM3 series.

Jun 3, 2026·24 min read·Andreas Jörg

Mech-interp · SAM3 series · Post 01

How a vision transformer learns a new task — what we found inside SAM3

A mechanistic-interpretability tour of fine-tuning SAM3 on 37 watch-component concepts. The weights move in a low-rank subspace, the task crystallises at a single 256-dim mid-stack tensor, the text encoder turns out to be optional — and the same checkpoint fails catastrophically out of domain. Post 1 of the SAM3 series.

May 19, 2026·22 min read·Andreas Jörg