⌨️ Open Source
Selected projects across LLM training, agentic AI, and biomedical machine learning.
Most of my open-source work lives on GitHub, where I publish training recipes, fine-tuning code, and research artifacts. Below are the ones I'd recommend looking at first.
🧠 LLM Training & Fine-tuning
4 projects🐉
qwen-scratch-0.6B
Building a Qwen-style 0.6B transformer from scratch in PyTorch, end-to-end pretraining recipe with tokenizer, data pipeline, and training loop.
🔁
Continual-pre-training
Continual pretraining recipes for adapting base LLMs to new domains without catastrophic forgetting. Includes data mixing strategies and eval pipeline.
💎
efficient-gemma3-finetuning
LoRA / QLoRA fine-tuning recipes for Gemma3, optimized for single-GPU and multi-GPU setups. Includes data formatting and inference scripts.
⤳
TAS
Temporal Attractor Steering: a retrieval-free, training-free inference-time framework that detects, localizes, and steers parametric temporal conflicts in open-weight LMs. Project page with the verified PTC benchmark and code.
🤖 Agentic AI & RAG
3 projects🛡️
NEXUS
Structured runtime safety monitor for tool-using LLM agents. Deterministic rules + calibrated risk scoring under a four-class intervention policy (allow / block / confirm / revise). Project page with datasets and code.
🔍
RAG_Agentic_AI
Retrieval-augmented agentic AI system combining vector search with multi-step reasoning loops. MIT-licensed reference implementation.
📚
Multi-PDF-Chat-Agent
Chat with multiple PDFs at once via a RAG pipeline. Vector store, chunking strategies, and a clean conversational interface.
🧬 Healthcare & Bio AI
3 projects🏥
MiniHealthLM
A small domain-adapted language model for healthcare and clinical tasks. Pretraining + instruction tuning recipe on medical corpora.
🦴
Bone-Fracture-Detection
Deep learning pipelines for detecting fractures on X-ray imagery. Compares ResNet, DenseNet, and EfficientNet backbones on benchmark data.
🧪
RNA-Seq-MoML
Multi-omics machine learning pipeline for RNA-Seq data analysis. Companion code for biological reasoning research.
📚 Datasets & Benchmarks
4 datasets🫧
nanobubbleeval
Benchmark for schema-constrained extraction from nanobubble and nanocarrier literature. 51,566 deduplicated records with 40 gold-annotated examples across 18 fields (size, zeta potential, stability, loading efficiency, release profile).
🎗️
CancerAbstracts
1,874 biomedical research abstracts labeled by cancer type (Lung, Thyroid, Colon, Generic). For text classification and biomedical NLP research.
🧬
BioDivergence-Silver-v1.0
11.9k biomedical claim pairs from scientific articles for NLI, claim verification, and contradiction detection. Includes labeled contradictions, evidence spans, and publication metadata with train/val/test splits.
⏳
ptc-benchmark
Temporal QA benchmark with 9,250 examples testing how LLMs handle knowledge conflicts and reasoning over time-varying facts. Wikidata-sourced with old/new entity values and their time periods.
🛠️ Other & Tutorials
3 projects📊
Neural-network-visualizer
Interactive neural network visualizer for teaching forward passes, activations, and gradients. Useful for explainability and intro courses.
⚛️
Quantum-ML
Quantum machine learning experiments and notebooks. Comparing classical and quantum approaches on small benchmark tasks.
🎙️
deepspeech2
DeepSpeech2-based speech recognition implementation from earlier industry work. CNN + RNN architecture for end-to-end ASR.