Rough research notes, paper summaries, and working ideas.
These are rough notes on trustworthy LLMs, RL alignment, biomedical AI, and other things I'm thinking through. Writing them quickly means I share more, but quality varies, take with a grain of salt. If something looks wrong, please let me know.
TL;DR. Re-read this after a year of preference optimization work. The "skip the reward model, just classify" framing still feels too clean to be real. It is.
read 2026-05-12dpoalignmentrlhf
My notes
Why I picked this up
MAV uses DPO as the inner alignment loss, so I wanted to revisit the original framing now that I've actually shipped a DPO-based pipeline.
What stood out
The closed-form policy extraction is the kind of result that makes you wonder why nobody noticed it earlier.
The real win isn't the metric numbers, it's the stability: no PPO collapse, no separate reward-model drift.
"Your LM is secretly a reward model" reframed how I think about implicit rewards hiding inside supervised data.
Connection to my work
MAV filters corrupted preferences before DPO. The intuition that verification can replace fine-tuning partly comes from DPO showing how much complexity the explicit reward model was hiding.
What I'm trying next
Compare MAV-filtered DPO against vanilla DPO at a fixed corruption budget. I want to find where MAV's gain saturates.
TL;DR. A model can self-critique using a written constitution. The direct ancestor of what I'm doing with MAV, just with one verifier instead of many.
read 2026-05-05alignmentsafetyrlaif
My notes
Why I picked this up
MAV uses multi-agent verification. CAI was the first major paper to show AI feedback could replace human red-teaming for safety alignment, so it's foundational reading.
What stood out
The two-stage SL + RL design is more elegant than I remembered.
Chain-of-thought during the critique step improves both the judged quality and the interpretability of the decision.
Non-evasive harmlessness, engage with harmful queries by explaining the objection rather than refusing flatly, is an underrated framing.
Connection to my work
MAV is essentially CAI extended to multiple verifier agents with disagreement-aware aggregation. Re-reading CAI exposed which design choices I'd quietly inherited.
What I'm trying next
Test whether constitutional rules can be made domain-specific. A drug discovery constitution for BioGen's reasoning verifier would be a clean experiment.
TL;DR. Interleave a thought with an action, then repeat. Almost every "agentic LLM" framework today is some version of this loop.
read 2026-04-28agentsreasoningtools
My notes
Why I picked this up
BioGen is an agentic LLM framework for RNA-Seq reasoning. I needed to understand the agent-reasoning literature from the source, and ReAct is the source.
What stood out
The thought-action-observation trace is human-readable and auditable. That alone makes it a win over opaque chain-of-thought.
Hallucinations drop sharply when reasoning is grounded in real tool calls.
Big jumps with only 1-2 in-context examples and no fine-tuning. Hard to overstate how surprising this was at the time.
Connection to my work
BioGen extends ReAct with verifier agents that check biological plausibility at each step. ReAct's "act on the world to ground reasoning" maps directly to my "verify biological evidence before continuing."
What I'm trying next
Measure how often BioGen's verifier disagrees with the ReAct-style reasoning trace. Are disagreements informative or just noise?
TL;DR. The paper that productionized RLHF. 1.3B aligned beats 175B base on user preference. The clearest argument I've seen that data quality beats parameters.
read 2026-04-20rlhfalignmentsft
My notes
Why I picked this up
To understand the full PPO-based RLHF pipeline before working on top of the cleaner DPO version. Knowing the original cost makes me appreciate DPO's simplification more.
What stood out
1.3B aligned vs 175B base is a stunning result for data quality > parameters.
The alignment tax exists but is much smaller than feared. Reassuring for production.
Truthfulness and toxicity gains came at minimal capability cost, counter to the "scale only" narrative of the time.
Connection to my work
This is the recipe MAV operates on top of. Replace the human labelers in the reward-modeling step with a verifier agent, and you get something MAV-shaped.
What I'm trying next
Sketch a comparison: PPO-RLHF vs DPO vs OAIF vs MAV. Where does each fit on the supervision-quality ร compute frontier?
TL;DR. Modern deep nets are wildly overconfident. One scalar (temperature scaling) fixes most of it. The starting point of basically every uncertainty paper since.
read 2026-04-10calibrationuncertainty
My notes
Why I picked this up
UAT-LITE is about uncertainty in pretrained transformers. Guo et al. is the foundational reference, almost every calibration paper since builds on it.
One scalar parameter T (temperature scaling) is enough to mostly fix it. The simplicity is the message.
Calibration is a "free" deployment win, no retraining, just a held-out set.
Connection to my work
UAT-LITE inherits this lineage. Where Guo et al. tunes T post-hoc, I modulate attention with epistemic uncertainty at inference time. Same spirit, deeper intervention.
What I'm trying next
Run temperature scaling on top of UAT-LITE outputs and check if calibration improves further. A clean ablation.
TL;DR. Models can self-supervise tool use. The "insert API calls during pretraining, keep the ones that lower loss" trick is one of those moves I now see reused everywhere.
read 2026-03-28agentstoolssft
My notes
Why I picked this up
For BioGen's tool-calling design. BioGen needs to call BLAST, GO ontology lookups, and pathway databases. Toolformer is the cleanest paper on how to teach a model when to call a tool.
What stood out
Pure self-supervision, no human tool-use labels. Scales nicely.
The candidate API + loss-decrease filter is a clever trick I now see reused in many follow-ups.
Core LM perplexity preserved. Matters for production deployment.
Connection to my work
BioGen's tool selection follows Toolformer's pattern but adds verifier feedback as an additional filter. The verifier catches "tool calls that look right but return biologically implausible answers."
What I'm trying next
Compare BioGen's verifier-filtered tool calls against pure loss-filtered Toolformer-style calls on a biomedical QA benchmark.
TL;DR. Online-policy DPO with an LLM annotator. Beats offline DPO and PPO-RLHF. They assume the annotator is reliable; MAV is what happens when you can't.
read 2026-03-15alignmentdpoonline-learning
My notes
Why I picked this up
Directly relevant to my "verified online supervision" thesis. They use an LLM as annotator without verification. MAV is what happens when the annotator can be wrong.
What stood out
Annotator-prompt-as-objective-knob is elegant. Want a different alignment objective? Change the prompt.
On-policy sampling closes the train-deploy gap that offline DPO has.
The ablations are clean. Online vs offline is isolated as a single variable.
Connection to my work
MAV is essentially OAIF + verification. They assume the LLM annotator is reliable. I show how to filter when it isn't.
What I'm trying next
Replicate OAIF's setup with a corrupted annotator (e.g., random label flip at p=0.2) and check if MAV closes the gap.
Bridges to MedBayes-Lite. Clinical models need to say "I don't know" rather than be confidently wrong. Geifman gave the formal framework I've been leaning on.
What stood out
Reject-option for deep nets, simple but powerful. Surprising it didn't show up earlier.
2% top-5 ImageNet error at 60% coverage with 99.9% confidence is wild for 2017.
Works on any pre-trained net via softmax response or MC-dropout. Zero retraining.
Connection to my work
MedBayes-Lite uses calibrated uncertainty for selective prediction. Geifman gave the framework. I extend it with Bayesian uncertainty in transformer attention.
What I'm trying next
Compare MedBayes-Lite's risk-coverage curve against vanilla softmax + Geifman's method on a clinical decision support benchmark.
TL;DR. Decoder-only Transformers can collapse distinct inputs to identical hidden states, especially in low precision. Connects LLM failure modes to GNN over-squashing in a satisfying way.
read 2026-02-12transformerstheorygnn
My notes
Why I picked this up
One of the few recent theory papers that explains LLM failure modes in a way I find satisfying. The GNN over-squashing connection was unexpected and worth chasing.
What stood out
Connects LLM failures (counting, copying) to GNN over-squashing. A beautiful theoretical bridge.
Low-precision floats exacerbate collapse. A real production concern with bf16 / fp8 inference.
Their proposed fixes are surprisingly simple, mostly higher-precision final layers.
Connection to my work
UAT-LITE is about salvaging information in attention at inference time. Barbero's collapse result suggests where in the network the information loss happens. Useful for designing better attention interventions.
What I'm trying next
Check if UAT-LITE attention modulation reduces representational collapse on counting tasks. Could be a nice qualitative experiment.
๐ All Topics
click to filter notes
๐ฌ Discussion
Sign in with GitHub below to leave a comment or react. Powered by Giscus.