🧠 Notes

Rough research notes, paper summaries, and working ideas.

These are rough notes on trustworthy LLMs, RL alignment, biomedical AI, and other things I'm thinking through. Writing them quickly means I share more, but quality varies, take with a grain of salt. If something looks wrong, please let me know.

⭐ Featured Notes

my four must-reads

⚖️

Direct Preference Optimization

The cleanest result I've seen in alignment. Skip the reward model, just classify. MAV builds directly on this.

🔗

Online AI Feedback (OAIF)

On-policy DPO with an LLM annotator. They assume reliability; MAV is what happens when you can't.

🛡️

Selective Classification for DNNs

Pick a risk budget, the model abstains on uncertain inputs. The formal framework behind MedBayes-Lite.

🤖

ReAct: Reasoning + Acting

Interleave thought and tool call, repeat. The agentic-LLM loop that BioGen extends with verifier agents.

🗒️ Reading Log

my annotations, not summaries

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

TL;DR. Re-read this after a year of preference optimization work. The "skip the reward model, just classify" framing still feels too clean to be real. It is.

read 2026-05-12 dpo alignment rlhf

My notes

Why I picked this up

MAV uses DPO as the inner alignment loss, so I wanted to revisit the original framing now that I've actually shipped a DPO-based pipeline.

What stood out

The closed-form policy extraction is the kind of result that makes you wonder why nobody noticed it earlier.
The real win isn't the metric numbers, it's the stability: no PPO collapse, no separate reward-model drift.
"Your LM is secretly a reward model" reframed how I think about implicit rewards hiding inside supervised data.

Connection to my work

MAV filters corrupted preferences before DPO. The intuition that verification can replace fine-tuning partly comes from DPO showing how much complexity the explicit reward model was hiding.

What I'm trying next

Compare MAV-filtered DPO against vanilla DPO at a fixed corruption budget. I want to find where MAV's gain saturates.

Constitutional AI: Harmlessness from AI Feedback

TL;DR. A model can self-critique using a written constitution. The direct ancestor of what I'm doing with MAV, just with one verifier instead of many.

read 2026-05-05 alignment safety rlaif

My notes

Why I picked this up

MAV uses multi-agent verification. CAI was the first major paper to show AI feedback could replace human red-teaming for safety alignment, so it's foundational reading.

What stood out

The two-stage SL + RL design is more elegant than I remembered.
Chain-of-thought during the critique step improves both the judged quality and the interpretability of the decision.
Non-evasive harmlessness, engage with harmful queries by explaining the objection rather than refusing flatly, is an underrated framing.

Connection to my work

MAV is essentially CAI extended to multiple verifier agents with disagreement-aware aggregation. Re-reading CAI exposed which design choices I'd quietly inherited.

What I'm trying next

Test whether constitutional rules can be made domain-specific. A drug discovery constitution for BioGen's reasoning verifier would be a clean experiment.

ReAct: Synergizing Reasoning and Acting in Language Models

TL;DR. Interleave a thought with an action, then repeat. Almost every "agentic LLM" framework today is some version of this loop.

read 2026-04-28 agents reasoning tools

My notes

Why I picked this up

BioGen is an agentic LLM framework for RNA-Seq reasoning. I needed to understand the agent-reasoning literature from the source, and ReAct is the source.

What stood out

The thought-action-observation trace is human-readable and auditable. That alone makes it a win over opaque chain-of-thought.
Hallucinations drop sharply when reasoning is grounded in real tool calls.
Big jumps with only 1-2 in-context examples and no fine-tuning. Hard to overstate how surprising this was at the time.

Connection to my work

BioGen extends ReAct with verifier agents that check biological plausibility at each step. ReAct's "act on the world to ground reasoning" maps directly to my "verify biological evidence before continuing."

What I'm trying next

Measure how often BioGen's verifier disagrees with the ReAct-style reasoning trace. Are disagreements informative or just noise?

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

TL;DR. The paper that productionized RLHF. 1.3B aligned beats 175B base on user preference. The clearest argument I've seen that data quality beats parameters.

read 2026-04-20 rlhf alignment sft

My notes

Why I picked this up

To understand the full PPO-based RLHF pipeline before working on top of the cleaner DPO version. Knowing the original cost makes me appreciate DPO's simplification more.

What stood out

1.3B aligned vs 175B base is a stunning result for data quality > parameters.
The alignment tax exists but is much smaller than feared. Reassuring for production.
Truthfulness and toxicity gains came at minimal capability cost, counter to the "scale only" narrative of the time.

Connection to my work

This is the recipe MAV operates on top of. Replace the human labelers in the reward-modeling step with a verifier agent, and you get something MAV-shaped.

What I'm trying next

Sketch a comparison: PPO-RLHF vs DPO vs OAIF vs MAV. Where does each fit on the supervision-quality × compute frontier?

On Calibration of Modern Neural Networks

TL;DR. Modern deep nets are wildly overconfident. One scalar (temperature scaling) fixes most of it. The starting point of basically every uncertainty paper since.

read 2026-04-10 calibration uncertainty

My notes

Why I picked this up

UAT-LITE is about uncertainty in pretrained transformers. Guo et al. is the foundational reference, almost every calibration paper since builds on it.

What stood out

Depth, width, weight decay, BatchNorm. Every "deep learning wins" choice hurts calibration. Striking inverse correlation.
One scalar parameter T (temperature scaling) is enough to mostly fix it. The simplicity is the message.
Calibration is a "free" deployment win, no retraining, just a held-out set.

Connection to my work

UAT-LITE inherits this lineage. Where Guo et al. tunes T post-hoc, I modulate attention with epistemic uncertainty at inference time. Same spirit, deeper intervention.

What I'm trying next

Run temperature scaling on top of UAT-LITE outputs and check if calibration improves further. A clean ablation.

Toolformer: Language Models Can Teach Themselves to Use Tools

TL;DR. Models can self-supervise tool use. The "insert API calls during pretraining, keep the ones that lower loss" trick is one of those moves I now see reused everywhere.

read 2026-03-28 agents tools sft

My notes

Why I picked this up

For BioGen's tool-calling design. BioGen needs to call BLAST, GO ontology lookups, and pathway databases. Toolformer is the cleanest paper on how to teach a model when to call a tool.

What stood out

Pure self-supervision, no human tool-use labels. Scales nicely.
The candidate API + loss-decrease filter is a clever trick I now see reused in many follow-ups.
Core LM perplexity preserved. Matters for production deployment.

Connection to my work

BioGen's tool selection follows Toolformer's pattern but adds verifier feedback as an additional filter. The verifier catches "tool calls that look right but return biologically implausible answers."

What I'm trying next

Compare BioGen's verifier-filtered tool calls against pure loss-filtered Toolformer-style calls on a biomedical QA benchmark.

Direct Language Model Alignment from Online AI Feedback (OAIF)

TL;DR. Online-policy DPO with an LLM annotator. Beats offline DPO and PPO-RLHF. They assume the annotator is reliable; MAV is what happens when you can't.

read 2026-03-15 alignment dpo online-learning

My notes

Why I picked this up

Directly relevant to my "verified online supervision" thesis. They use an LLM as annotator without verification. MAV is what happens when the annotator can be wrong.

What stood out

Annotator-prompt-as-objective-knob is elegant. Want a different alignment objective? Change the prompt.
On-policy sampling closes the train-deploy gap that offline DPO has.
The ablations are clean. Online vs offline is isolated as a single variable.

Connection to my work

MAV is essentially OAIF + verification. They assume the LLM annotator is reliable. I show how to filter when it isn't.

What I'm trying next

Replicate OAIF's setup with a corrupted annotator (e.g., random label flip at p=0.2) and check if MAV closes the gap.

Selective Classification for Deep Neural Networks

TL;DR. Pick a risk budget, the model abstains on uncertain inputs. The cleanest framework for safe medical AI I've found.

read 2026-02-25 uncertainty safety selective-prediction

My notes

Why I picked this up

Bridges to MedBayes-Lite. Clinical models need to say "I don't know" rather than be confidently wrong. Geifman gave the formal framework I've been leaning on.

What stood out

Reject-option for deep nets, simple but powerful. Surprising it didn't show up earlier.
2% top-5 ImageNet error at 60% coverage with 99.9% confidence is wild for 2017.
Works on any pre-trained net via softmax response or MC-dropout. Zero retraining.

Connection to my work

MedBayes-Lite uses calibrated uncertainty for selective prediction. Geifman gave the framework. I extend it with Bayesian uncertainty in transformer attention.

What I'm trying next

Compare MedBayes-Lite's risk-coverage curve against vanilla softmax + Geifman's method on a clinical decision support benchmark.

Transformers Need Glasses! Information Over-squashing in Language Tasks

TL;DR. Decoder-only Transformers can collapse distinct inputs to identical hidden states, especially in low precision. Connects LLM failure modes to GNN over-squashing in a satisfying way.

read 2026-02-12 transformers theory gnn

My notes

Why I picked this up

One of the few recent theory papers that explains LLM failure modes in a way I find satisfying. The GNN over-squashing connection was unexpected and worth chasing.

What stood out

Connects LLM failures (counting, copying) to GNN over-squashing. A beautiful theoretical bridge.
Low-precision floats exacerbate collapse. A real production concern with bf16 / fp8 inference.
Their proposed fixes are surprisingly simple, mostly higher-precision final layers.

Connection to my work

UAT-LITE is about salvaging information in attention at inference time. Barbero's collapse result suggests where in the network the information loss happens. Useful for designing better attention interventions.

What I'm trying next

Check if UAT-LITE attention modulation reduces representational collapse on counting tasks. Could be a nice qualitative experiment.

🔖 All Topics

click to filter notes

🧠 Notes

⭐ Featured Notes

Direct Preference Optimization

Online AI Feedback (OAIF)

Selective Classification for DNNs

ReAct: Reasoning + Acting

🗒️ Reading Log

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

Why I picked this up

What stood out

Connection to my work

What I'm trying next

🔖 All Topics

💬 Discussion