Study Notes: Stanford CS336 Language Modeling from Scratch [12]
From MacBook to Cloud: A Practical Guide to Developing and Scaling LLM Training Code
When developing machine learning training pipelines, there’s often a disconnect between local development environments and production-scale cloud infrastructure. You might prototype on your laptop (say, a MacBook with Apple Silicon), only to discover that your code breaks on CUDA GPUs, or that patterns that worked locally don’t scale in the cloud.
In this note, I’ll share my workflow for developing Supervised Fine-Tuning (SFT) code on a MacBook with Apple Silicon, testing it locally, then seamlessly deploying to
Google Colab or multi-GPU cloud instances like Lambda Labs.
This workflow was developed while implementing SFT for Qwen2.5-Math-1.5B on the MATH dataset (for CS336 Assignment 5), but the principles apply broadly to any PyTorch-based training pipeline development.
Table of Contents
- From MacBook to Cloud: A Practical Guide to Developing and Scaling LLM Training Code
- Table of Contents
- The Challenge: Bridging Local and Cloud Development
- Part 1: Setting Up Local Development Environment
- Part 2: Writing Device-Agnostic Training Code
- Part 3: Local Testing and Validation
- Part 4: Packaging for Cloud Deployment
- Part 5: Deploying to Google Colab
- Part 6: Scaling to Multi-GPU with Accelerate
- Part 7: Practical Recommendations and Lessons Learned
- Conclusion
The Challenge: Bridging Local and Cloud Development
My typical ML development workflow faces a fundamental tension—I use a MacBook Pro with M4 chips for personal side projects, which creates some tradeoffs:
| Environment | Pros | Cons |
|---|---|---|
| Local (MacBook) | Fast iteration, no cost, familiar tools | Limited memory, slower training, no CUDA (many GPU acceleration frameworks only support CUDA) |
| Cloud (Colab/Lambda) | Powerful GPUs, scalable, CUDA support | Setup overhead, costs money, less interactive |
The ideal workflow would let me:
- Develop locally with fast feedback loops
- Test easily before committing cloud resources
- Deploy seamlessly without rewriting code
- Scale horizontally when more compute is available
This note presents a battle-tested approach to achieving all four.
Part 1: Setting Up Local Development Environment
Why Apple Silicon for ML Development?
Beyond personal preference (I’ve been an Apple product fan since grad school), Apple Silicon Macs offer a genuinely compelling development environment:
- Unified Memory Architecture: 16–64GB RAM shared between CPU and GPU
- Metal Performance Shaders (MPS): PyTorch backend for GPU acceleration
- Power Efficiency: Extended battery life for portable development
- Native ARM: Fast Python and native tool execution
However, there are important limitations:
| Feature | CUDA (NVIDIA) | MPS (Apple Silicon) |
|---|---|---|
| Float16 Training | Stable with gradient scaling | Often causes NaN losses |
| BFloat16 | Full support (Ampere+) | Not supported |
| Multi-GPU | NCCL, NVLink | Single GPU only |
| Flash Attention | Available | Not available |
| Memory | Dedicated VRAM | Shared system RAM |
Key Insight: MPS ( Metal Performance Shaders—Apple’s GPU-accelerated compute framework for macOS and iOS) is excellent for development and testing but usually requires float32 precision for numerical stability. I need plan for this difference when writing device-agnostic code.
Project Structure and Package Management
I use uv for fast, reproducible Python package management. Here’s how I set up my local dev environment for CS336 Assignment 5.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
Project structure:
assignment5-alignment/
├── cs336_alignment/
│ ├── __init__.py
│ ├── sft.py # Main training code
│ └── prompts/
│ └── r1_zero.prompt # Prompt template
├── scripts/
│ ├── run_sft.py # Training entry point
│ ├── download_model.py # Model downloader
│ └── download_math.py # Data downloader
├── notebooks/
│ └── sft_training_colab.ipynb
├── pyproject.toml # Dependencies
└── uv.lock # Locked versions
pyproject.toml with optional CUDA dependencies:
[project]
name = "alignment"
requires-python = ">=3.11,<3.13"
dependencies = [
"accelerate>=1.5.2",
"torch",
"transformers>=4.50.0",
"tqdm>=4.67.1",
"matplotlib>=3.8.0",
]
[project.optional-dependencies]
cuda = [
"flash-attn==2.7.4.post1",
]
Local installation:
uv sync # Basic install (Mac/CPU)
uv sync --extra cuda # With CUDA extras (Linux with GPU)
Part 2: Writing Device-Agnostic Training Code
The key to seamless local-to-cloud transitions is writing code that adapts to available hardware without manual changes.
Handling Device Detection
def get_device(device_str: str = "auto") -> str:
"""Get the best available device for training."""
if device_str != "auto":
return device_str
if torch.cuda.is_available():
return "cuda"
elif torch.backends.mps.is_available():
return "mps"
else:
return "cpu"
Numerical Precision Considerations
This is where many developers encounter their first “works locally, fails on cloud” (or vice versa) bug:
def get_dtype_and_precision(device: str) -> tuple[torch.dtype, str]:
"""
Determine appropriate dtype and mixed precision setting.
Critical insight: MPS does NOT support float16 training reliably.
Using float16 on MPS often results in NaN losses due to lack of
proper mixed-precision support and gradient scaling.
"""
if device == "cuda":
# CUDA: Use bfloat16 if available (Ampere+), else float16
if torch.cuda.is_bf16_supported():
return torch.bfloat16, "bf16"
else:
return torch.float16, "fp16"
else:
# MPS and CPU: Use float32 for numerical stability
return torch.float32, "no"
Why This Matters:
| Device | Recommended Dtype | Reason |
|---|---|---|
| CUDA (Ampere+) | bfloat16 | Best balance of speed and stability |
| CUDA (older) | float16 | With gradient scaling |
| MPS | float32 | float16 may cause NaN losses |
| CPU | float32 | No mixed precision benefit |
I learned this the hard way when my training showed loss: nan on MPS after working fine conceptually. The fix was simple once identified:
# Before (broken on MPS)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
# After (works everywhere)
dtype = torch.float32 if device in ["mps", "cpu"] else torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype)
Gradient Accumulation for Memory Efficiency
With limited memory on laptops (even 32GB unified memory), gradient accumulation is essential:
# Effective batch size = batch_size * gradient_accumulation_steps
# Example: batch_size=1, grad_accum=8 -> effective batch of 8
for step, batch in enumerate(dataloader):
# Forward pass
loss = compute_loss(model, batch)
# Scale loss for gradient accumulation
scaled_loss = loss / gradient_accumulation_steps
scaled_loss.backward()
# Only update weights every N steps
if (step + 1) % gradient_accumulation_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Memory Scaling Recommendations For CS336 Assignment 5 SFT on Qwen2.5-Math-1.5B with the MATH dataset
| Device | Chip Generations | Typical Memory | Found In | batch_size | gradient_accumulation_steps | Effective Batch |
|---|---|---|---|---|---|---|
| Apple M-series (base) | M1, M2, M3, M4 | 8–16GB | MacBook Air, 13” MacBook Pro | 1 | 16 | 16 |
| Apple M-series Pro | M1, M2, M3, M4 | 18–48GB | 14”/16” MacBook Pro | 2–4 | 4–8 | 16 |
| Apple M-series Max | M1, M2, M3, M4 | 36–128GB | 14”/16” MacBook Pro (high-end) | 4–8 | 2–4 | 16 |
| Apple M-series Ultra | M1, M2 | 64–192GB | Mac Studio, Mac Pro | 8–16 | 1–2 | 16 |
| NVIDIA A100 (40GB) | — | 40GB | Cloud (Lambda, GCP, AWS) | 8 | 2 | 16 |
| NVIDIA A100 (80GB) | — | 80GB | Cloud (Lambda, GCP, AWS) | 16 | 1 | 16 |
Effective batch = batch_size Ă— gradient_accumulation_steps. Larger batch sizes reduce training time but require more memory.
Key insights:
-
Memory constrains batch size, not effective batch size. When GPU memory is limited, reduce
batch_sizeand increasegradient_accumulation_stepsto maintain the same effective batch size. The model sees identical gradients either way—accumulation just trades memory for time. -
Gradient accumulation is a memory-saving trick. Instead of computing gradients on 16 samples at once (which requires storing all intermediate activations), you process 1 sample 16 times, accumulating gradients before each optimizer step. This uses ~1/16th the memory at the cost of ~16Ă— more forward/backward passes.
-
Effective batch size should stay constant across devices. Notice that all rows target an effective batch of 16. This ensures consistent training dynamics regardless of hardware—important for reproducibility when moving between local development and cloud training.
-
Diminishing returns on large batch sizes. Beyond a certain point, larger batch sizes don’t proportionally speed up training due to memory bandwidth limits (GPUs can only move data so fast, once your batch is large enough to fully utilize the GPU, making it bigger just creates a queue—the GPU can’t process it any faster) and reduced gradient noise (which can actually help optimization).
Part 3: Local Testing and Validation
Before deploying to cloud, thorough local testing saves time and money.
Quick Sanity Checks
# Test with minimal samples to verify pipeline works
uv run python scripts/run_sft.py \
--model-name-or-path models/qwen2.5-math-1.5b \
--train-data-path data/math/train.jsonl \
--output-dir outputs/sft_test \
--num-samples 10 \
--num-epochs 1 \
--batch-size 1 \
--gradient-accumulation-steps 2
What to verify:
- Model loads without errors
- Data pipeline produces valid batches
- Loss decreases (not NaN or constant)
- Checkpoints save correctly
- Model can be reloaded from checkpoint
Inference Engine: Local vs Cloud
A key challenge when developing on Apple Silicon is that vLLM—the go-to inference engine for fast LLM serving—requires CUDA and doesn’t run on Macs. This means I need two inference backends during the initial development phase:
| Environment | Inference Backend | Why |
|---|---|---|
| Local (MPS) | HuggingFace Transformers | Pure PyTorch, runs anywhere |
| Cloud (CUDA) | vLLM | Optimized kernels, PagedAttention, 10–20× faster |
My approach: Write a simple abstraction layer that switches backends based on the available hardware:
def get_inference_backend(model_path: str, device: str):
"""Return appropriate inference backend for the current environment."""
if device == "cuda" and is_vllm_available():
from vllm import LLM
return VLLMBackend(LLM(model=model_path))
else:
# Fallback to HuggingFace for MPS/CPU
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
return TransformersBackend(model, tokenizer)
What this enables:
- Local development: Test generation logic, prompt templates, and output parsing using the Transformers backend on my Mac
- Cloud deployment: Automatically switch to vLLM for fast, batched inference without changing my evaluation code
Trade-off to keep in mind: my local inference is much slower than cloud. For local testing, I need to use small sample sizes (10–50 examples) to validate correctness, before move to run full evaluations on cloud.
Verifying Gradient Accumulation
A common bug is incorrect gradient accumulation scaling. Here’s a verification approach:
def verify_gradient_accumulation():
"""
Verify that accumulated gradients match single large batch.
The gradients should be identical (within floating point tolerance)
whether we:
1. Process 8 samples in one batch, or
2. Process 1 sample 8 times with gradient accumulation
"""
model_single = create_model()
model_accum = create_model()
# Copy weights
model_accum.load_state_dict(model_single.state_dict())
# Method 1: Single large batch
large_batch = get_batch(size=8)
loss = compute_loss(model_single, large_batch)
loss.backward()
grad_single = get_gradients(model_single)
# Method 2: Accumulated small batches
for i in range(8):
small_batch = get_batch(size=1)
loss = compute_loss(model_accum, small_batch) / 8 # Scale!
loss.backward()
grad_accum = get_gradients(model_accum)
# Verify they match
assert torch.allclose(grad_single, grad_accum, rtol=1e-4)
Part 4: Packaging for Cloud Deployment
Repository Structure
Push my code to GitHub for easy cloud access, for example
git add cs336_alignment/ scripts/ notebooks/ pyproject.toml uv.lock
git commit -m "Add SFT training pipeline"
git push origin main
Dependency Management with uv
The uv.lock file ensures reproducible environments:
# Generate lock file locally
uv lock
# On cloud, install exact versions
uv sync # Reads uv.lock automatically
Why uv over pip/conda/poetry?
| Aspect | pip | conda | Poetry | uv |
|---|---|---|---|---|
| Speed | Moderate | Slow | Slow | Very fast (Rust-based) |
| Lock file | ❌ (requires pip-tools) | ❌ (manual export) | ✅ | ✅ |
| PyTorch/CUDA handling | Manual | Good | Finicky | Smooth |
| Mac → Linux portability | Poor | Poor | Good | Excellent |
| Dependency resolution | Basic | Solver can be slow | Good but slow | Fast and reliable |
Why this matters for ML workflows:
-
Speed: ML projects have heavy dependencies (PyTorch, Transformers, flash-attn). Poetry can take 30–60s to resolve; uv takes 1–5s.
-
PyTorch complexity: PyTorch has separate wheels for CPU, CUDA 11.8, CUDA 12.1, etc. Poetry often requires manual configuration with custom sources. uv handles this automatically.
-
Cross-platform: I am developing on Mac (ARM) and deploying to Linux (x86 + CUDA). uv’s lock file captures platform-specific metadata, so
uv syncinstalls the correct versions on each platform without separate environment files.
When you might still choose Poetry:
- Publishing packages to PyPI (Poetry has built-in support)
- Your team already uses it and has established workflows
- You need Poetry’s plugin ecosystem
For ML development workflows like this one, uv’s speed and PyTorch handling are significant wins.
Part 5: Deploying to Google Colab
Single GPU Training on Google Colab
Google Colab provides easy access to cloud GPUs without any setup. With your packaged repo, you can create a notebook with the following cells to run training on Colab:
# Cell 1: Clone and setup
!git clone https://github.com/YOUR_USERNAME/assignment5-alignment.git
%cd assignment5-alignment
!git checkout main
# Cell 2: Install uv and dependencies
!curl -LsSf https://astral.sh/uv/install.sh | sh
import os
os.environ['PATH'] = f"{os.path.expanduser('~')}/.local/bin:{os.environ['PATH']}"
!uv sync --extra cuda
# Cell 3: Download model and data
!uv run python scripts/download_model.py --model-name Qwen/Qwen2.5-Math-1.5B
!uv run python scripts/download_math.py
# Cell 4: Run training
!uv run python scripts/run_sft.py \
--model-name-or-path models/qwen2.5-math-1.5b \
--train-data-path data/math/train.jsonl \
--output-dir outputs/sft_model \
--batch-size 2 \
--gradient-accumulation-steps 8 \
--device cuda
Colab-Specific Considerations
| Aspect | Recommendation |
|---|---|
| Runtime selection | Runtime → Change runtime type → Select GPU (T4 for free tier, A100 for Pro+) |
| Session timeout | Save checkpoints every 1–2 epochs; free tier can preempt without warning |
| Persistence | Mount Google Drive for outputs to survive session resets |
| Memory limits | T4 has 16GB VRAM—use batch_size=2 with gradient accumulation |
| Background execution | Pro+ only—training continues after closing browser |
Google Drive mounting:
from google.colab import drive
drive.mount('/content/drive')
# Save outputs to Drive
output_dir = '/content/drive/MyDrive/sft_outputs'
Saving to Google Drive:
from google.colab import drive
drive.mount('/content/drive')
!cp -r outputs/sft_model/final /content/drive/MyDrive/sft_model
Part 6: Scaling to Multi-GPU with Accelerate
Why HuggingFace Accelerate
Google Colab typically provides only 1 GPU. For multi-GPU training (Lambda Labs, AWS, etc.), we can use HuggingFace Accelerate:
| Feature | Manual DDP | Accelerate |
|---|---|---|
| Code changes | Significant | Minimal |
| Device placement | Manual | Automatic |
| Gradient sync | Manual | Automatic |
| Mixed precision | Manual setup | One flag |
| Single/Multi GPU | Different code paths | Same code |
Code Changes for Multi-GPU Support
The key changes to support multi-GPU:
from accelerate import Accelerator
def train_sft(config):
# Initialize Accelerator
accelerator = Accelerator(
gradient_accumulation_steps=config.gradient_accumulation_steps,
mixed_precision="bf16" if device == "cuda" else "no",
)
# Prepare model, optimizer, dataloader
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)
for batch in dataloader:
# Use accelerator's gradient accumulation context
with accelerator.accumulate(model):
loss = compute_loss(model, batch)
accelerator.backward(loss) # Instead of loss.backward()
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# Save only on main process
if accelerator.is_main_process:
unwrapped_model = accelerator.unwrap_model(model)
save_model(unwrapped_model, output_dir)
Key Accelerate Patterns:
accelerator.prepare(): Wraps objects for distributed trainingaccelerator.accumulate(): Handles gradient accumulation correctlyaccelerator.backward(): Syncs gradients across devicesaccelerator.sync_gradients: True when accumulation cycle completesaccelerator.is_main_process: Only one process logs/saves
Lambda Labs Deployment
# SSH into Lambda instance
ssh ubuntu@your-instance-ip
# Setup
git clone https://github.com/YOUR_USERNAME/assignment5-alignment.git
cd assignment5-alignment
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.local/bin/env
uv sync --extra cuda
# Download model and data
uv run python scripts/download_model.py
uv run python scripts/download_math.py
# Multi-GPU training (auto-detects available GPUs)
uv run accelerate launch --multi_gpu scripts/run_sft.py \
--model-name-or-path models/qwen2.5-math-1.5b \
--batch-size 4 \
--gradient-accumulation-steps 2
Scaling Guide:
| GPUs | batch_size | grad_accum | Effective Batch | Command |
|---|---|---|---|---|
| 1 | 4 | 4 | 16 | uv run python scripts/run_sft.py |
| 2 | 4 | 2 | 16 | accelerate launch --num_processes 2 |
| 4 | 4 | 1 | 16 | accelerate launch --num_processes 4 |
| 8 | 4 | 1 | 32 | accelerate launch --num_processes 8 |
Part 7: Practical Recommendations and Lessons Learned
Development Workflow Summary
+------------------------------------------------------------------+
| LOCAL DEVELOPMENT (Mac) |
+------------------------------------------------------------------+
| 1. Write code with device-agnostic patterns |
| 2. Test with small samples (--num-samples 10) |
| 3. Verify loss decreases, no NaN |
| 4. Run unit tests (pytest) |
| 5. Commit and push to GitHub |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| CLOUD VALIDATION (Colab) |
+------------------------------------------------------------------+
| 1. Clone repo, install dependencies |
| 2. Quick test with 100 samples |
| 3. Verify CUDA path works correctly |
| 4. Check memory usage fits GPU |
| 5. Save checkpoint to Google Drive |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| PRODUCTION TRAINING (Lambda/Cloud) |
+------------------------------------------------------------------+
| 1. Use accelerate launch for multi-GPU |
| 2. Full dataset training |
| 3. Monitor with logging/wandb |
| 4. Save final model and metrics |
+------------------------------------------------------------------+
Common Pitfalls and Solutions
| Pitfall | Symptom | Solution |
|---|---|---|
| Float16 on MPS | loss: nan |
Use float32 on MPS |
| Wrong grad accumulation | Gradients don’t match | Divide loss by accumulation steps |
Missing is_main_process check |
Duplicate logs/saves | Guard with accelerator.is_main_process |
| Hardcoded device | Crashes on different hardware | Use get_device("auto") |
| No checkpoint saving | Lost progress on timeout | Save every N steps |
Performance Comparison
From my experiments with Qwen2.5-Math-1.5B on MATH dataset:
| Environment | Device | batch_size x grad_accum | Time per 100 steps |
|---|---|---|---|
| MacBook M2 Pro | MPS | 1 x 8 | ~45 min |
| Colab Free | T4 | 2 x 8 | ~12 min |
| Colab Pro | A100 | 8 x 2 | ~3 min |
| Lambda (4x A100) | 4x A100 | 4 x 1 (per GPU) | ~1 min |
Conclusion
Developing ML training code that works seamlessly from a MacBook to multi-GPU cloud instances requires intentional design:
- Device-agnostic code: Abstract device selection and dtype handling
- Numerical stability: Use float32 on MPS, mixed precision on CUDA
- Memory efficiency: Implement gradient accumulation from the start
- Reproducible environments: Use
uvwith lock files - Distributed-ready: Integrate Accelerate for painless multi-GPU scaling
The workflow I’ve shared—develop locally on MacBook, validate on Colab, scale on cloud with distributed training—provides fast iteration during development while enabling production-scale training when needed. The key insight is that the code should adapt to the hardware, not the other way around.
I hope this empowers you to develop confidently on your laptop, knowing that deploying to powerful cloud GPUs is a matter of changing a single command—not rewriting your training pipeline.
Resources: