What I write about

Saturday, 19 July 2025

A deep technical breakdown of how ChatGPT works

How ChatGPT Works – A Deep Technical Dive

How ChatGPT Works – A Deep Technical Dive

๐ŸŒŸ INTRODUCTION: The Magic Behind the Curtain

Have you ever asked ChatGPT something — like "Summarize this news article," or "Explain AI like I'm 10" — and wondered, how is this even possible? Let's walk through how ChatGPT truly works — with real examples, visual metaphors, and detailed technical explanations.


๐Ÿง  PART 1: ChatGPT Is a Probability Machine

ChatGPT doesn't understand language like humans. It generates text by predicting what comes next, one token at a time.

Example:

You type: "The Eiffel Tower is in" —

ChatGPT looks at all the training data it's seen — books, websites, conversations — and estimates what word is most likely to come next. It may think:

  • "Paris" → 85% probability
  • "France" → 10%
  • "Europe" → 4%
  • "a movie" → 1%

The highest probability token wins — so it outputs "Paris."

This process continues:

  • "The Eiffel Tower is in Paris" → ✅
  • Then it predicts the next token again: maybe a period (.), or "and," or another phrase — depending on context.

๐Ÿ”ข Technically, the model learns a probability distribution over sequences of tokens. At each step:

This is called auto-regressive generation — one token at a time, using all tokens before.


๐Ÿ”ก PART 2: What's a Token?

Tokens are chunks of words, not entire words or characters. For example:

  • "ChatGPT is amazing" → ["Chat", "GPT", " is", " amazing"]

Each input and output are broken into tokens. GPT doesn't generate entire sentences at once — just one token at a time.

๐Ÿง  Why tokens? They strike a balance: fewer tokens than characters, more flexible than full words.

๐Ÿ“ Context window: ChatGPT remembers a fixed number of tokens:

  • GPT-3.5: ~4,096
  • GPT-4: ~8,000–32,000 (depends on variant)

Once you go beyond the context window, it forgets the earlier tokens — like a rolling memory.


๐Ÿงฐ PART 3: What Powers It Underneath

ChatGPT runs on a type of deep neural network called a Transformer. Invented in 2017, it revolutionized AI.

๐Ÿง  1. Embeddings: Giving Meaning to Words

What it is: Each word (token) is converted into a vector — a long list of numbers that captures meaning.

Analogy: Think of this like assigning GPS coordinates to every word. Similar words (like "Paris" and "London") will end up close to each other in this multi-dimensional map.

Example:

  • "Paris" → [0.25, -0.11, ..., 0.87] (Vector of 768 or 2048+ numbers)

This helps the model "understand" that Paris is a place, just like New York or Tokyo.

๐ŸŽฏ 2. Self-Attention: Context-Aware Focus

What it is: This lets the model decide which words in the sentence are important — and how much they should influence the current prediction.

Analogy: When reading the sentence:

"The cat that chased the mouse was fast,"

You understand that "was" refers to "cat," not "mouse."

How it works: Every token computes its attention with every other token using:

  • Q (Query)
  • K (Key)
  • V (Value)

The model multiplies these vectors, normalizes using √(dimension size), and generates attention scores — higher scores mean more relevance.

So, when predicting the word after "was," the model gives higher attention to "cat."

๐Ÿงช 3. Feedforward Layers: Refining the Understanding

What it is: After attention decides what's important, feedforward layers refine the info.

Analogy: Imagine reading a sentence, focusing on key words, and then pausing to think deeply about what it means.

Each vector is transformed like this:

  • Output = ReLU (Wx + b)

Here:

  • x is the input vector
  • W and b are weights and biases the model learns
  • ReLU is a non-linear activation that helps it generalize

This makes the model capable of nuanced understanding — e.g. knowing that "bark" means different things in "tree bark" and "dog bark."

๐Ÿ” 4. Residual Connections + Layer Normalization: Keeping It Stable

What it is: To prevent vanishing gradients and unstable learning in deep networks, each layer adds the original input back in and normalizes it.

Analogy: Like re-reading the last sentence to make sure you didn't lose track.

This helps the model train deeper and faster without losing past understanding.

๐Ÿš€ End-to-End Flow — Let's Put It All Together!

Let's walk through an example with a real prompt:

๐Ÿ“ Prompt:

"The Eiffel Tower is in"

What Happens Behind the Scenes:

  1. Tokenization
    → Break into tokens: ["The", " Eiffel", " Tower", " is", " in"]
  2. Embedding
    → Each token gets a high-dimensional vector (based on training data)
  3. Positional Encoding
    → Adds info like "this is the 1st, 2nd, 3rd word..."
  4. Transformer Layers (48+ times!)
    Each layer does:
    • Compute self-attention → figure out what to focus on
    • Pass through feedforward → transform meaning
    • Apply residuals + layer norm → keep things stable
  5. Prediction
    → The model predicts the most likely next token: "Paris"
  6. Loop
    → New prompt becomes: "The Eiffel Tower is in Paris" → predict next word

GPT doesn't see your sentence like a human. It sees patterns in numbers — but thanks to this layered structure, it can complete your thoughts with surprising fluency.

Every prediction is a mathematical dance of meaning, memory, and probability — played out over hundreds of Transformer layers.

That's how something as simple as "The Eiffel Tower is in..." becomes "Paris."


⚙️ PART 4: How It Was Trained

Training GPT involves three phases:

  1. Pretraining:
    • Trained on a huge corpus (websites, books, code) to predict the next token.
    • Objective: Minimize cross-entropy loss
  2. Supervised Fine-Tuning (SFT):
    • Human annotators provide example dialogues.
    • Model learns more structured, helpful responses.
  3. Reinforcement Learning with Human Feedback (RLHF):
    • Two models are trained: the base model and a reward model.
    • The base model generates outputs. The reward model scores them.
    • GPT is then fine-tuned using Proximal Policy Optimization (PPO) to prefer higher-rated responses.

๐Ÿงญ PART 5: What About Memory and History?

ChatGPT doesn't "remember" across chats — unless it's explicitly given memory (like in ChatGPT Pro).

Within a session, everything is stored in the context window.

Example: You: "What's the capital of France?" ChatGPT: "Paris" You: "And its population?" ← This relies on previous context.

If the context window is exceeded, the model may "forget" earlier parts.


๐ŸŽจ PART 6: How It Generates Images (via DALL·E)

ChatGPT can interface with image models, like DALL·E 3, to turn text prompts into visuals.

How It Works

  1. Tokenization: The prompt (e.g. "A panda surfing in space") is tokenized → converted to embeddings.
  2. Conditioning: These embeddings guide a diffusion model — trained to convert noise into meaningful images.
  3. Diffusion Process: The model starts with pure Gaussian noise (static). Over 20–50 steps, it learns to denoise it into a realistic image.

Math Behind It:

The process is trained to reverse a noise function. The generation step solves:

Model Used: U-Net + cross-attention layers conditioned on the prompt.

Example:

Prompt: "A futuristic library floating in clouds" → Image generated pixel by pixel by reversing the noise process.

DALL·E doesn't paint like a human — it mathematically interpolates what each patch should look like, step by step.


๐ŸŒ PART 7: How It Uses Real-Time Information

ChatGPT (GPT-4 or GPT-3.5) is not inherently aware of current events.

But with browsing enabled, it can pull in real-time info.

How:

  • Your query → sent to Bing or another search engine
  • The response is skimmed for trusted sources
  • Key sentences are summarized

Example: You ask: "Who won the IPL final yesterday?" → ChatGPT browses → Finds ESPN or Cricbuzz → Extracts result → Summarizes answer

It Doesn't Browse Like You:

  • No clicking
  • No scrolling
  • No loading ads

It reads the raw HTML/text and processes it very quickly.

Limitations:

  • Might misread poorly formatted content
  • May hallucinate if sources contradict
  • Can't verify deep nuance like a human journalist

⚠️ PART 8: Where It Goes Wrong

ChatGPT is incredibly powerful, but still fallible.

Reasons for Errors:

  1. Hallucinations:
    • It may confidently make up facts.
    • Cause: Over-generalization from training data.
  2. Stale Knowledge:
    • Offline GPTs don't know recent events.
    • Example: "Tell me who won the 2025 Nobel Prize" → No answer unless browsing is on.
  3. Context Limit:
    • Long chats may exceed token limit → forgetting happens.
  4. Biases:
    • If biased content was in training data, model might echo it.

Risk Scenarios:

  • Medical advice: May offer outdated or unsafe info
  • Legal queries: Lacks jurisdiction-specific nuance
  • Code generation: Can return insecure or buggy code

๐Ÿง  PART 9: Why It Feels So Smart

Even though it's a token predictor, ChatGPT seems intelligent. Why?

Emergent Behaviour:

  • With billions of parameters and terabytes of data, it captures deep statistical patterns
  • It can compose essays, write poems, answer riddles — all via probability

System Prompt and Guardrails:

  • OpenAI uses a system prompt to shape personality, tone, safety
  • Example: "You are ChatGPT, a helpful assistant."

Examples of Smartness:

  • Can do multi-step math (with help)
  • Can translate Shakespearean English
  • Can critique its own answers

๐ŸŽ“ CONCLUSION: It's Just Math. But Really Good Math.

ChatGPT is a statistical machine trained on massive data, optimized with human feedback, and guided by clever engineering.

It doesn't "think" — but its performance often feels magical.

The secret? Huge data + deep networks + careful tuning.

And now, you understand what's behind the curtain.