Study Notes: Stanford CS336 Language Modeling from Scratch [4]

Demystifying GPT-2’s Pre-Tokenization: How One Regex Pattern Handles the World’s Languages

While working on Assignment 1 of Stanford’s CS336: Language Modeling from Scratch, I came across a deceptively simple — yet remarkably powerful — regex pattern used in the pre-tokenization stage of the BPE algorithm.

I thought it would be worthwhile to share my notes and walk through how this single pattern can handle text from multiple languages, scripts, and symbol sets with precision.

You can find my full BPE assignment implementation here:

BPE training algorithm: bpe.py
Tokenizer class: tokenizer.py

🔧 How to Run the BPE Training Process

Clone the repository

git clone https://github.com/bearbearyu1223/assignment1-basics.git
cd assignment1-basics

Set up the local development environment
Follow the instructions in the developer_guide.md.
Run BPE training
```
uv run cs336_basics/train_bpe_example.py
```
This will train a BPE tokenizer using the TinyStoriesV2-GPT4-train.txt dataset, with 10,000 vocabulary size and with special token "<|endoftext|>".

🧪 How to Test the Tokenizer

Run:

uv run pytest tests/test_train_bpe.py

This will validate the tokenizer’s functionality and ensure the pre-tokenization regex behaves as expected.

📜 The GPT-2 Split Pattern

import regex

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

This single (but mighty) regex is responsible for splitting text into meaningful segments — words, numbers, punctuation, symbols, whitespace — in a way that is consistent across languages and scripts.

🔍 Pattern Breakdown

1. Contractions

'(?:[sdmt]|ll|ve|re)

Matches common English contractions starting with an apostrophe:
's, 'd, 'm, 't, 'll, 've, 're

Examples:

"don't" → ["don", "'t"]
"we're" → ["we", "'re"]

2. Letters (Any Language)

 ?\p{L}+

Matches letters from any Unicode language (with optional leading space).

\p{L} = Unicode “Letter” category
Covers: English, Chinese, Arabic, accented characters, and more.

Examples:

"hello world" → ["hello", " world"]
"café 北京" → ["café", " 北京"]

3. Numbers (Any Script)

 ?\p{N}+

Matches numbers from any writing system (with optional leading space).

\p{N} = Unicode “Number” category
Covers: Arabic numerals (0–9), Roman numerals (Ⅰ, Ⅱ, Ⅲ), Arabic-Indic (٠١٢), etc.

Examples:

"I have 5 items" → ["I", " have", " 5", " items"]
"Ⅲ winners" → ["Ⅲ", " winners"]

4. Punctuation / Symbols

 ?[^\s\p{L}\p{N}]+

Matches punctuation or symbols (with optional leading space).

[^\s\p{L}\p{N}] = NOT whitespace, NOT letters, NOT numbers
Captures: !@#$%^&*()_+-=[]{}|;:'",./<>?

Examples:

"Wow!!!" → ["Wow", "!!!"]
" $100" → [" $", "100"]

5. Trailing Whitespace

\s+(?!\S)

Matches whitespace at the end of text or before more whitespace.
This ensures trailing spaces are preserved as tokens.

6. General Whitespace

\s+

Matches any remaining whitespace.

🛠 Testing the Pattern

Here’s a helper function to test how this regex splits different inputs.

def test_regex(text, description=""):
    """Test the regex pattern and display results clearly"""
    print(f"\n{'='*60}")
    print(f"TEST: {description}")
    print(f"INPUT: '{text}'")
    print(f"{'='*60}")

    matches = regex.findall(GPT2_SPLIT_PATTERN, text)

    print(f"TOKENS ({len(matches)}):")
    for i, token in enumerate(matches, 1):
        print(f"  {i:2d}: {repr(token)}") 

    reconstructed = ''.join(matches)
    print(f"\nRECONSTRUCTION CHECK: {'✓ PASS' if reconstructed == text else '✗ FAIL'}")
    return matches

🧪 Real-World Test Cases

Below are diverse examples — from contractions to Unicode scripts, punctuation to code.

test_cases = [
    ("I can't believe it's working!", "Basic contractions"),
    ("You're right, they'll see we've done it.", "Multiple contractions"),
    ("Hello 世界! Café français 🌍", "Unicode letters and emoji"),
    ("I have 5 cats, ٧ dogs, and Ⅲ birds.", "Various number systems"),
    ("Wait... What?!? $100.50 (seriously)!!!", "Complex punctuation"),
    ("  Multiple   spaces   everywhere  ", "Multiple spaces"),
    ("She's got $1,000.50 in café № ٧... Amazing!!! 🚀", "Complex mixed text"),
    ("'s'd'm't'll've're", "Contraction edge cases"),
    ("!@#$%^&*()_+-=[]{}|;:",./<>?", "Pure punctuation"),
    ("   \t\n  ", "Pure whitespace"),
    ("", "Empty string"),
    ("a 1 ! '", "Single characters"),
    ("我有3只猫，很可爱！", "Chinese with numbers"),
    ("مرحبا بالعالم ١٢٣", "Arabic text with numbers"),
    ("def hello_world(): return 'Hello, World!'", "Code-like text"),
    ("Visit https://example.com or email test@domain.co.uk", "URLs and emails"),
]

for text, description in test_cases:
    test_regex(text, description)

Running these cases produces token lists that perfectly reconstruct the original text, see the test results below:

    ============================================================
    TEST: Basic contractions
    INPUT: 'I can't believe it's working!'
    ============================================================
    TOKENS (8):
       1: 'I'
       2: ' can'
       3: "'t"
       4: ' believe'
       5: ' it'
       6: "'s"
       7: ' working'
       8: '!'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Multiple contractions
    INPUT: 'You're right, they'll see we've done it.'
    ============================================================
    TOKENS (12):
       1: 'You'
       2: "'re"
       3: ' right'
       4: ','
       5: ' they'
       6: "'ll"
       7: ' see'
       8: ' we'
       9: "'ve"
      10: ' done'
      11: ' it'
      12: '.'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Unicode letters and emoji
    INPUT: 'Hello 世界! Café français 🌍'
    ============================================================
    TOKENS (6):
       1: 'Hello'
       2: ' 世界'
       3: '!'
       4: ' Café'
       5: ' français'
       6: ' 🌍'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Various number systems
    INPUT: 'I have 5 cats, ٧ dogs, and Ⅲ birds.'
    ============================================================
    TOKENS (12):
       1: 'I'
       2: ' have'
       3: ' 5'
       4: ' cats'
       5: ','
       6: ' ٧'
       7: ' dogs'
       8: ','
       9: ' and'
      10: ' Ⅲ'
      11: ' birds'
      12: '.'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Complex punctuation
    INPUT: 'Wait... What?!? $100.50 (seriously)!!!'
    ============================================================
    TOKENS (11):
       1: 'Wait'
       2: '...'
       3: ' What'
       4: '?!?'
       5: ' $'
       6: '100'
       7: '.'
       8: '50'
       9: ' ('
      10: 'seriously'
      11: ')!!!'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Multiple spaces
    INPUT: '  Multiple   spaces   everywhere  '
    ============================================================
    TOKENS (7):
       1: ' '
       2: ' Multiple'
       3: '  '
       4: ' spaces'
       5: '  '
       6: ' everywhere'
       7: '  '
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Complex mixed text
    INPUT: 'She's got $1,000.50 in café № ٧... Amazing!!! 🚀'
    ============================================================
    TOKENS (17):
       1: 'She'
       2: "'s"
       3: ' got'
       4: ' $'
       5: '1'
       6: ','
       7: '000'
       8: '.'
       9: '50'
      10: ' in'
      11: ' café'
      12: ' №'
      13: ' ٧'
      14: '...'
      15: ' Amazing'
      16: '!!!'
      17: ' 🚀'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Contraction edge cases
    INPUT: ''s'd'm't'll've're'
    ============================================================
    TOKENS (7):
       1: "'s"
       2: "'d"
       3: "'m"
       4: "'t"
       5: "'ll"
       6: "'ve"
       7: "'re"
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Pure punctuation
    INPUT: '!@#$%^&*()_+-=[]{}|;:",./<>?'
    ============================================================
    TOKENS (1):
       1: '!@#$%^&*()_+-=[]{}|;:",./<>?'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Pure whitespace
    INPUT: '   	
      '
    ============================================================
    TOKENS (1):
       1: '   \t\n  '
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Empty string
    INPUT: ''
    ============================================================
    TOKENS (0):
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Single characters
    INPUT: 'a 1 ! ''
    ============================================================
    TOKENS (4):
       1: 'a'
       2: ' 1'
       3: ' !'
       4: " '"
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Chinese with numbers
    INPUT: '我有3只猫，很可爱！'
    ============================================================
    TOKENS (6):
       1: '我有'
       2: '3'
       3: '只猫'
       4: '，'
       5: '很可爱'
       6: '！'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Arabic text with numbers
    INPUT: 'مرحبا بالعالم ١٢٣'
    ============================================================
    TOKENS (3):
       1: 'مرحبا'
       2: ' بالعالم'
       3: ' ١٢٣'
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: Code-like text
    INPUT: 'def hello_world(): return 'Hello, World!''
    ============================================================
    TOKENS (11):
       1: 'def'
       2: ' hello'
       3: '_'
       4: 'world'
       5: '():'
       6: ' return'
       7: " '"
       8: 'Hello'
       9: ','
      10: ' World'
      11: "!'"
    
    RECONSTRUCTION CHECK: ✓ PASS
    
    ============================================================
    TEST: URLs and emails
    INPUT: 'Visit https://example.com or email test@domain.co.uk'
    ============================================================
    TOKENS (15):
       1: 'Visit'
       2: ' https'
       3: '://'
       4: 'example'
       5: '.'
       6: 'com'
       7: ' or'
       8: ' email'
       9: ' test'
      10: '@'
      11: 'domain'
      12: '.'
      13: 'co'
      14: '.'
      15: 'uk'
    
    RECONSTRUCTION CHECK: ✓ PASS

💡 Why This Matters

BPE pre-tokenization sets the stage for the tokenizer’s merge rules to apply. A well-designed split pattern ensures:

Language independence — Works with English, Arabic, Chinese, emoji, etc.
Symbol awareness — Keeps punctuation and symbols intact.
Whitespace fidelity — Preserves exact spacing for reversible tokenization.
Downstream accuracy — Reduces surprises during model training or inference.

🚀 Takeaways

This regex is a core building block of GPT-2’s tokenization process.
It’s language-agnostic, Unicode-friendly, and precise in splitting.
Understanding it helps when building custom tokenizers or adapting GPT-2 BPE to new domains.

If you’re working with LLMs, tokenization, or multilingual NLP, knowing the details behind this pattern will help you debug, customize, and optimize your preprocessing pipeline.