Study Notes: Stanford CS336 Language Modeling from Scratch [15]
A hands-on guide to training a math reasoning model with GRPO on Lambda Cloud using 2xH100 GPUs — improving Qwen2.5-Math-1.5B accuracy from ~6% to ~25% with practical implementation details.