Lecture L1

Vision Transformers (ViT)

Architecture, attention mechanism, variants, and the vanishing-gradient story — a complete undergraduate lecture adapted from the CEAMLS slide deck.

Patch embeddingSelf-attentionMulti-Head (MSA)Encoder blockLayerNorm + ResidualsDeiT · Swin · BEiT · CaiT · MaxViTVanishing gradient proofViT vs CNN
1

Why Transformers for Vision? CNNs vs ViT

Fundamental limitations of convolution and what Transformers offer instead.

CNN limitation

A 3×3 kernel sees only 9 pixels at a time — a local receptive field. To let a top-left pixel influence a bottom-right pixel, you must stack many conv layers, and the global receptive field is still approximate and path-dependent.

What ViT brings
  • Global self-attention — every patch attends to every other patch in a single layer.
  • No baked-in inductive bias — the model learns translation invariance (or not) from data.
  • Unified architecture with NLP Transformers → multi-modal models, transfer learning.
  • Scales with data and parameters; ViT-G (6B params) hits 90%+ on ImageNet.
FeatureCNNViT
Context rangeLocal (kernel size)Global (all patches)
Inductive biasStrong (translation inv.)Weak — learned from data
Data neededLess (biases help)More, or pre-training
ComputationO(N·k²)O(N²) — quadratic in tokens
Long-range dependenciesMany stacked layersOne single layer ✓
2

Patch Embedding & Tokenisation

How an image becomes a sequence of tokens — the very first step of ViT.

  1. Original image — 224×224×3 RGB.
  2. Extract patches — divide into non-overlapping P×P squares (P=16) → 14×14 = 196 patches of 16×16×3 = 768 values each. Each patch is a "word".
  3. Linear projection — flatten and multiply by learnable matrix E ∈ ℝP²C × D → each patch becomes a D=768 embedding. Mathematically equivalent to a Conv 16×16, stride 16.
  4. Prepend [CLS] token — learnable vector that aggregates global information; its output is the classification feature (BERT-style).
  5. Add positional encoding — learnable vector per position. Without it, self-attention is permutation-invariant and cannot tell top-left from bottom-right.
[3, 224, 224] → 196 × [768] → linear → [196, 768] → + CLS → [197, 768] → + PE → [197, 768]
3

Self-Attention — Query, Key, Value

The mathematical core of the Transformer — how patches 'talk to' each other.

Each token produces three vectors: Q (what I'm looking for), K (what I offer), V (my content).

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

Q = X · W_Q    (queries)
K = X · W_K    (keys)
V = X · W_V    (values)
d_k = dimension of key vectors  (scaling)
  • Q·Kᵀ is an [N×N] matrix; entry (i,j) = how much token i wants to attend to token j.
  • Softmax turns each row into probabilities summing to 1.
  • ·V takes a weighted sum of values — the output for each token.
Why divide by √d_k?
For large d_k, raw dot products grow large → softmax saturates → gradient ≈ 0 (vanishing gradient on the attention path). Scaling by √d_k keeps logits in a healthy range. d_k=64 → divide by 8; d_k=256 → divide by 16.
4

Multi-Head Self-Attention (MSA)

Running attention in parallel across multiple representation subspaces.

MSA(X) = Concat(head₁, …, head_h) · W_O
head_i = Attention(X·W_Qi, X·W_Ki, X·W_Vi)
Different aspects — Head 1 may focus on texture (nearby patches), Head 2 on semantic similarity, Head 3 on object boundaries.
Richer representations — h heads project into h subspaces in parallel — far richer than one.
ViT-Base — h=12 heads, D=768, d_k=d_v=64. 12×64=768 (dims preserved). ~2.36M params per MSA layer.
5

Complete ViT Architecture

From image → patches → 12 encoder blocks → CLS → classification head.

Image 224×224×3
   ↓ Patch Embed (16×16, stride 16)
   ↓ + [CLS] + Positional Encoding
[197 × 768]
   ↓ × 12 Encoder Blocks
[197 × 768]
   ↓ extract CLS token
[1 × 768]
   ↓ MLP head
[K classes]
ViT-Base config: P=16, N=196, D=768, L=12 blocks, h=12 heads, d_k=64, FFN hidden=3072 (4·D), 86M parameters.
VariantLayersHidden DHeadsParams
ViT-Small12384622M
ViT-Base127681286M
ViT-Large24102416307M
ViT-Huge32128016632M
6

Transformer Encoder Block — Inside One Block

Two residual sub-blocks: Pre-LN → MSA → ⊕, Pre-LN → FFN → ⊕.

# Pre-LN Transformer block (used in ViT)
x'  = MSA( LN(x)  ) + x        # residual 1
x'' = FFN( LN(x') ) + x'       # residual 2

# Layer Norm (per-token, across D features):
LN(x) = γ ⊙ (x − μ) / √(σ² + ε)  +  β

# Feed-Forward Network (per-token, independent):
FFN(x) = GELU(x·W₁ + b₁) · W₂ + b₂      # 768 → 3072 → 768
  • Pre-LN (normalise before MSA/FFN) is more stable than Post-LN for deep stacks.
  • MSA mixes tokens; FFN does not — FFN acts independently per token, a learned "memory" lookup.
  • GELU (Gaussian Error Linear Unit) smoothly gates the input; outperforms ReLU in Transformers.
  • Two residuals per block × 12 blocks = 24 gradient highways from output back to input.
7

Vanishing Gradient — Why Residuals Save ViT

A short proof that the +x in x' = F(x)+x guarantees gradient flow.

x_{ℓ+1} = F_ℓ(x_ℓ) + x_ℓ           (residual block)

∂L/∂x_0 = ∂L/∂x_L · ∏_{ℓ=0..L-1} (∂F_ℓ/∂x_ℓ + I)

Even if ∂F_ℓ/∂x_ℓ → 0  (vanishing), the +I term keeps
the product at least the identity → gradient never collapses to 0.

Combined with LayerNorm (keeps activations at μ=0, σ=1) and the √d_k scaling inside attention, ViT trains stably to 12, 24, even 32 layers deep — something CNNs only achieved after the ResNet (2015) skip-connection breakthrough.

8

ViT Variants

Five families you should know — when to reach for each.

VariantYearKey ideaBest for
DeiT2021Data-efficient ViT; distillation token from a CNN teacher.Small/medium datasets — no JFT-300M pre-training needed.
Swin2021Shifted-window attention; hierarchical, linear complexity.Dense prediction (detection, segmentation), high-res imagery.
BEiT2021BERT-style masked image modelling self-supervised pre-training.Label-scarce domains (medical, satellite).
CaiT2021Class-Attention layers; LayerScale enables very deep ViTs.Maximum ImageNet accuracy at fixed compute.
MaxViT2022Block-attention + grid-attention; hybrid with conv stem.Long-range + local features in one network — strong all-rounder.
9

ViT vs CNN — When to Use Which

SituationPrefer
< 50k labelled images, no pre-trainingCNN (ResNet/EfficientNet)
Large pre-training corpus available (ImageNet-21k, JFT)ViT
High-resolution dense prediction (segmentation)Swin / MaxViT
Multi-modal (image + text)ViT (shares Transformer with NLP)
Edge device, tight FLOPs budgetCNN or MobileViT
Satellite / medical, self-supervised pre-trainingBEiT / MAE-pretrained ViT
10

Equations cheat sheet

text
1Patch embedding: z_0 = [x_cls; x_p^1 E; x_p^2 E; … ; x_p^N E] + E_pos
2Attention head: A_i = softmax(Q_i K_iᵀ / √d_k) V_i
3MSA: MSA(x) = Concat(A_1,…,A_h) W_O
4Encoder block: x' = MSA(LN(x)) + x ; z = FFN(LN(x')) + x'
5FFN: FFN(x) = GELU(x W_1 + b_1) W_2 + b_2
6Classification: ŷ = softmax( LN(z_L^{cls}) · W_head )
Key takeaway
ViT replaces convolution's local prior with global self-attention. With residual connections and LayerNorm, it scales gracefully to billions of parameters — and is the backbone of modern damage-assessment encoders in our Siamese pipeline.