Lecture L3

Siamese Neural Networks

Architecture, variants and theory — from Bell Labs signature verification (1993) to modern xBD pre/post damage assessment.

Shared encoderDifference blockSkip connectionsAttention gatesDeep supervisionContrastive / TripletVanishing gradient proof
1

What is a Siamese Network?

Origins, intuition, and the core idea.

A Siamese network is a neural architecture containing two (or more) identical sub-networks that share the same weights. Each branch processes a different input; their outputs are compared to produce a similarity score or difference map.

  • Bromley et al. (1993) — Bell Labs, signature verification. The original "Siamese" twin network.
  • Koch et al. (2015) — one-shot image classification via contrastive loss.
  • Modern uses — face verification (FaceID-style), satellite change detection (xBD), medical image comparison, self-supervised learning (SimSiam, BYOL).
Why "Siamese"?
Two branches share weights — like Siamese twins sharing a body. Updating one updates the other. They are literally the same network, run twice.
Four key properties
① Shared weights · ② Symmetric · ③ Comparison output · ④ Half the parameters of two separate nets.
2

Core Architecture for xBD Damage Assessment

Pre-event (t₁)  ──▶ [Shared Encoder θ] ──▶ F_pre  ┐
                                                  ├─▶ Difference Block |F_pre − F_post| ─▶ Decoder ─▶ Damage map (5 classes)
Post-event (t₂) ──▶ [Shared Encoder θ] ──▶ F_post ┘                                                    │
                                                                                                      ▼
                                                                                          Combined Loss = Focal + Dice
                                                                                                      │
                                                              ◀──── Backprop (updates ONE shared encoder θ) ────
  • Pre/post inputs — 512×512×3 satellite images of the same geographic location at two times.
  • Shared encoder — ResNet-50/101 or ViT, parameters θ₁ = θ₂. Output: feature maps [C × H/8 × W/8].
  • Difference block — |Fpre − Fpost| (or learned fusion). Highlights what changed.
  • Decoder — upsamples H/8 → H/4 → H/2 → H with skip connections from encoder stages.
  • Output — [5 × H × W] tensor; argmax gives per-pixel damage class.
3

Inside the Shared Encoder

Conv → BN → ReLU → pool, four times.

Input 512×512×3
  → Conv Block 1  → 256×256×64   (stride 2, 64 filters)
  → Conv Block 2  → 128×128×128
  → Conv Block 3  →  64×64×256
  → Conv Block 4  →  32×32×512   ← Feature map for comparison
Conv 3×3
Extracts local edges, textures, shapes.
BatchNorm
x̂ = (x−μ)/(σ+ε); y = γx̂+β. Keeps activations healthy → stable gradients.
ReLU
f(x)=max(0,x). Non-linearity that does not saturate for positive inputs (unlike sigmoid).
4

Decoder — Skip Connections & Attention Gates

U-Net-style skip connections route encoder features directly to the matching decoder stage, restoring spatial detail that pooling discarded.

Decoder stage k:
  u_k     = Upsample( d_{k-1} )                  # coarse decoder features
  skip_k  = Encoder feature at same resolution
  d_k     = Conv( Concat[ u_k, AttentionGate(skip_k, u_k) ] )

# Attention gate (Oktay et al., 2018)
α = σ( ψ( ReLU( W_x · x + W_g · g ) ) )           # gate ∈ [0,1] per pixel
gated_skip = α ⊙ x 

The attention gate lets the decoder ask "which encoder pixels are relevant here?" and suppresses the rest — sharper boundaries, fewer false positives on background.

5

Loss Functions for Siamese Training

LossFormulaWhen to use
ContrastiveL = y·D² + (1−y)·max(0, m−D)²Pairwise similarity (face verify, signature)
TripletL = max(0, D(a,p) − D(a,n) + m)Anchor/positive/negative — face recognition
Focal−α(1−pₜ)^γ log(pₜ)Imbalanced segmentation (damage)
Dice1 − 2|A∩B| / (|A|+|B|)Region overlap — boundary quality
Focal + DiceL_F + L_DDAHiTrA / xBD damage assessment ✓
6

Siamese Variants

VariantKey ideaWhen to use
Classic SiameseTwo identical branches, shared θSame modality, same domain (xBD)
Pseudo-SiameseSame architecture, separate θSlight domain shift (e.g. different sensors)
AsymmetricDifferent architectures per branchCross-modality (RGB vs SAR, optical vs infrared)
TripletThree branches (anchor/pos/neg), shared θMetric learning, face recognition
Self-supervised (SimSiam / BYOL / DINO)Two augmented views of one imageLabel-scarce pre-training
7

Vanishing Gradient — Cause & Cure

Why deep Siamese stacks need help, and how skip connections + deep supervision fix it.

For a plain deep net:
∂L/∂x_0 = ∂L/∂x_L · ∏_{ℓ}  ∂f_ℓ / ∂x_ℓ

If each |∂f_ℓ/∂x_ℓ| < 1, the product shrinks geometrically → gradient → 0.

Residual block:  x_{ℓ+1} = f_ℓ(x_ℓ) + x_ℓ
∂x_{ℓ+1}/∂x_ℓ = ∂f_ℓ/∂x_ℓ + I
→ even if ∂f_ℓ/∂x_ℓ ≈ 0, the +I keeps the product ≥ identity → gradient survives.

Deep supervision goes further by attaching auxiliary Focal+Dice losses at intermediate decoder stages, so early encoder layers get strong gradient signal directly — not only through a long chain.

8

PyTorch — Minimal Siamese Skeleton

python
1import torch, torch.nn as nn
2
3class SharedEncoder(nn.Module):
4 def __init__(self):
5 super().__init__()
6 # e.g. ResNet-50 backbone, output [B, 512, H/8, W/8]
7 self.backbone = build_resnet50_encoder()
8
9 def forward(self, x):
10 return self.backbone(x)
11
12class SiameseDamageNet(nn.Module):
13 def __init__(self, num_classes=5):
14 super().__init__()
15 self.encoder = SharedEncoder() # ONE encoder, used twice
16 self.decoder = UNetDecoder(in_channels=512, out_channels=num_classes)
17
18 def forward(self, pre, post):
19 f_pre = self.encoder(pre) # shared weights θ
20 f_post = self.encoder(post) # same θ — literally same nn.Module
21 diff = torch.abs(f_pre - f_post) # change signal
22 return self.decoder(diff) # [B, 5, H, W] damage logits
23
24# Training (with Focal + Dice from L2)
25model = SiameseDamageNet().cuda()
26optim = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
27
28for pre, post, mask in loader:
29 logits = model(pre.cuda(), post.cuda())
30 loss = focal_fn(logits, mask.cuda().long()) + dice_fn(logits, one_hot(mask))
31 optim.zero_grad(); loss.backward(); optim.step()
9

Which variant for which task?

TaskRecommended variantLoss
xBD building damage (pre/post)Classic Siamese + UNet decoderFocal + Dice + deep supervision
Signature / face verificationClassic Siamese, embedding headContrastive or Triplet
Optical vs SAR change detectionAsymmetric SiameseFocal + Dice
Self-supervised pre-trainingSimSiam / BYOL / DINOCosine similarity (no negatives)
One-shot classificationClassic SiameseContrastive
Key takeaway
Siamese = one encoder run twice + a comparison. Shared weights guarantee that pre and post are described in the same feature space — so subtraction is meaningful. Combined with Focal+Dice and deep supervision, it is the canonical architecture for modern disaster damage assessment.