Overview of Siamese Networks
A Siamese Network is a model architecture consisting of two (or more) identical neural networks arranged in parallel with shared weights. It is designed to learn and evaluate the similarity between inputs and was originally developed for tasks such as signature verification and face recognition (Bromley et al., 1993).
In a Siamese Network, each sub-network receives one of the inputs (e.g., image A and image B) and independently extracts a feature vector (embedding) from it. Since the weights are shared, both networks apply the same transformation to their respective inputs, ensuring fairness in the comparison.
The resulting feature vectors are then compared using a distance metric such as L1 distance (Manhattan distance), L2 distance (Euclidean distance), or cosine similarity, to compute the similarity between the two inputs.
To train the network based on this similarity, loss functions such as Contrastive Loss and Triplet Loss are commonly employed. Contrastive Loss encourages similar pairs to be closer together in the embedding space, while ensuring that dissimilar pairs are separated by a margin. Triplet Loss, on the other hand, uses three samples—Anchor, Positive, and Negative—and trains the network to maximize the distance between the Anchor and the Negative while minimizing the distance between the Anchor and the Positive.
In this way, Siamese Networks are specialized for learning a comparable feature space, and have been widely applied in tasks such as face recognition, signature verification, and similar image retrieval.
Related Algorithms
Siamese Networks are closely associated with similarity learning and metric learning. These models are based on the framework of measuring similarity between inputs in an embedding space, and numerous derived algorithms and applications have been proposed.
Classification by Loss Function
In similarity learning, various loss functions have been introduced to help the network learn what constitutes similarity or dissimilarity between data samples, typically based on distances or similarities.
Contrastive Loss
Trains the network so that the distance between embeddings is small for similar input pairs and large for dissimilar pairs. This encourages data from the same class to be close together, and data from different classes to be well separated in the feature space.
Formula:
L = (1 – y) * d² + y * max(0, margin – d)²
where d
is the distance between two samples, and y
is the label (y = 0
for similar, y = 1
for dissimilar).
Triplet Loss
Uses three inputs: an Anchor, a Positive (same class), and a Negative (different class). It trains the network so that the distance between the Anchor and Positive is smaller than that between the Anchor and Negative.
Formula:
L = max(d(a, p) – d(a, n) + margin, 0)
where a = anchor
, p = positive
, n = negative
.
Quadruplet Loss
Extends the Triplet Loss by adding another Negative sample, introducing additional constraints not only between the Anchor and other samples but also between multiple Negative samples and Positive samples. This further sharpens the decision boundary.
Formula:
L = Triplet_Loss + α * (d(n1, n2) – d(p1, p2))
where α
is a hyperparameter to adjust constraint strength, (n1, n2)
is the pair of negatives, and (p1, p2)
is the pair of positives.
N-pair Loss
Handles one Anchor with multiple Negatives at once, enabling the network to learn discriminative features more efficiently across the mini-batch compared to contrastive or triplet-based methods.
Formula:
L = log(1 + ∑exp(f⁺ᵀf⁻ – f⁺ᵀf⁺))
where f⁺
is the embedding of the positive sample, and f⁻
is that of the negatives.
Applications in Few-Shot Learning
Matching Networks
Leverages LSTM and attention mechanisms to classify a query sample based on its similarity to a support set (few labeled samples). It computes cosine similarity between the query and support set samples and predicts classes based on similarity weights. Like Siamese Networks, it relies on comparison-based classification.
Prototypical Networks
Computes a prototype for each class as the mean of the embeddings in the support set, then classifies queries based on their distance (typically Euclidean) to these prototypes. Unlike Siamese Networks which compare sample pairs, Prototypical Networks compare queries to class representatives—making it a simple yet powerful approach for few-shot classification.
Relation Networks
Instead of using a fixed distance metric, this method learns a relation module (a neural network) that outputs the similarity score between a query and support sample. This makes it more flexible than traditional Siamese approaches and suitable for capturing complex relationships such as visual or spatial similarity.
Meta-Learning Approaches (e.g., MAML)
Siamese-based models can also be combined with meta-learning techniques such as Model-Agnostic Meta-Learning (MAML). While MAML focuses on rapid adaptation to new tasks with few steps, incorporating Siamese structures enhances the underlying representation learning, enabling faster and more generalizable adaptation.
Extension to NLP: Siamese BERT (SBERT)
In natural language processing, Siamese architectures have been extended into models like Siamese BERT (SBERT). SBERT uses two parallel BERT encoders to independently embed a pair of sentences. Their cosine similarity is then used to assess semantic closeness. SBERT has shown strong performance in semantic textual similarity (STS), retrieval, and ranking tasks.
Example of Application Implementations
Siamese Networks are used in a variety of fields, including vision, natural language processing, biometrics, medicine, and search. Examples of implementations are described below.
Example of image recognition application: Comparison of handwritten numbers using Siamese Networks (MNIST)
Technologies used:
TensorFlow/Keras or PyTorch
MNIST dataset
Contrastive Loss or Triplet Loss
Implementation Overview:
Create MNIST image pairs (positive examples: same number, negative examples: different numbers)
Build a CNN-based Siamese Network
Finally, measure the distance between two feature vectors by L2
Apply Contrastive Loss to the loss function
After training, measure similarity between new images
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader
import random
import numpy as np
from PIL import Image
# 1. Data preprocessing and pair generation
class SiameseMNIST(Dataset):
def __init__(self, mnist_dataset):
self.mnist_dataset = mnist_dataset
self.targets = mnist_dataset.targets
self.data = mnist_dataset.data
def __getitem__(self, index):
img1, label1 = self.data[index], int(self.targets[index])
# Randomly select pairs with the same label OR pairs with different labels
should_get_same_class = random.randint(0, 1)
while True:
index2 = random.randint(0, len(self.data) - 1)
label2 = int(self.targets[index2])
if should_get_same_class == (label1 == label2):
break
img2 = self.data[index2]
transform = transforms.Compose([transforms.ToPILImage(), transforms.ToTensor()])
return (transform(img1), transform(img2), torch.tensor([int(label1 == label2)], dtype=torch.float32))
def __len__(self):
return len(self.mnist_dataset)
# 2. Siamese network definition
class SiameseNetwork(nn.Module):
def __init__(self):
super(SiameseNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(1, 16, 3), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3), nn.ReLU(), nn.MaxPool2d(2)
)
self.fc = nn.Sequential(
nn.Linear(32 * 5 * 5, 128),
nn.ReLU(),
nn.Linear(128, 64)
)
def forward_once(self, x):
x = self.cnn(x)
x = x.view(x.size()[0], -1)
return self.fc(x)
def forward(self, input1, input2):
output1 = self.forward_once(input1)
output2 = self.forward_once(input2)
return output1, output2
# 3. Contrastive Loss definition
class ContrastiveLoss(nn.Module):
def __init__(self, margin=1.0):
super(ContrastiveLoss, self).__init__()
self.margin = margin
def forward(self, output1, output2, label):
euclidean_distance = F.pairwise_distance(output1, output2)
loss = torch.mean((1 - label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2))
return loss
# 4. learning
def train(model, dataloader, loss_fn, optimizer, epochs=5):
for epoch in range(epochs):
total_loss = 0
for img1, img2, label in dataloader:
img1, img2, label = img1.cuda(), img2.cuda(), label.cuda()
output1, output2 = model(img1, img2)
loss = loss_fn(output1, output2, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")
# 5. execution
if __name__ == "__main__":
transform = transforms.ToTensor()
mnist_train = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
siamese_dataset = SiameseMNIST(mnist_train)
dataloader = DataLoader(siamese_dataset, shuffle=True, batch_size=64)
model = SiameseNetwork().cuda()
loss_fn = ContrastiveLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
train(model, dataloader, loss_fn, optimizer, epochs=5)
Implementation Example in Natural Language Processing: Sentence Similarity with Sentence-BERT (SBERT)
Technologies Used:
-
Hugging Face Transformers
-
Pre-trained model:
sentence-transformers/bert-base-nli-mean-tokens
-
Cosine similarity evaluation
Overview of Implementation:
-
Input any two sentences into BERT and obtain their sentence embedding vectors
-
Evaluate semantic similarity using cosine similarity (range: 0 to 1)
-
A high score indicates similarity, while a low score indicates dissimilarity
from sentence_transformers import SentenceTransformer, util
# 1. Model loading (separate models are possible for Japanese and other languages)
model = SentenceTransformer('all-MiniLM-L6-v2') # Lightweight and fast
# 2. Prepare a statement for comparison
sentences = [
"This is a book about deep learning.",
"This book explains neural networks in detail.",
"I like to play soccer on weekends.",
]
# 3. Convert text to vectors
embeddings = model.encode(sentences, convert_to_tensor=True)
# 4. Calculate similarity (cosine similarity)
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
# 5. output
print("Similarity matrix between sentences:")
for i in range(len(sentences)):
for j in range(len(sentences)):
print(f"({i}, {j}): {cosine_scores[i][j]:.4f}")
Medical Application Example: Similar Image Retrieval for Skin Lesions (ISIC Dataset)
Patient images are compared with known skin lesion images using a Siamese Network, allowing the system to present similar past cases.
This contributes to clinical decision support for physicians.
Feature extraction from medical images is typically performed using ResNet or EfficientNet-based backbones.
# library
pip install torch torchvision scikit-learn matplotlib
# data definition
from torch.utils.data import Dataset
import os
from PIL import Image
import random
import torch
from torchvision import transforms
class ISICSiameseDataset(Dataset):
def __init__(self, image_dir, labels_dict, transform=None):
self.image_dir = image_dir
self.labels_dict = labels_dict # filename -> label
self.transform = transform or transforms.ToTensor()
self.image_filenames = list(labels_dict.keys())
def __getitem__(self, idx):
img1_name = self.image_filenames[idx]
label1 = self.labels_dict[img1_name]
# 同じ/異なるラベル画像をランダムに選択
should_match = random.randint(0, 1)
while True:
img2_name = random.choice(self.image_filenames)
label2 = self.labels_dict[img2_name]
if (label1 == label2) == should_match:
break
img1 = Image.open(os.path.join(self.image_dir, img1_name)).convert('RGB')
img2 = Image.open(os.path.join(self.image_dir, img2_name)).convert('RGB')
return self.transform(img1), self.transform(img2), torch.tensor([int(label1 == label2)], dtype=torch.float32)
def __len__(self):
return len(self.image_filenames)
# Model Definition
import torch.nn as nn
import torchvision.models as models
class SiameseResNet(nn.Module):
def __init__(self):
super().__init__()
base_model = models.resnet18(pretrained=True)
base_model.fc = nn.Identity() # Remove classification layer
self.encoder = base_model
self.fc = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 128)
)
def forward_once(self, x):
x = self.encoder(x)
x = self.fc(x)
return x
def forward(self, x1, x2):
out1 = self.forward_once(x1)
out2 = self.forward_once(x2)
return out1, out2
# Contrastive Loss
import torch.nn.functional as F
class ContrastiveLoss(nn.Module):
def __init__(self, margin=2.0):
super().__init__()
self.margin = margin
def forward(self, out1, out2, label):
distance = F.pairwise_distance(out1, out2)
loss = torch.mean((1 - label) * distance.pow(2) +
(label) * F.relu(self.margin - distance).pow(2))
return loss
# learning loop
def train(model, dataloader, optimizer, criterion, epochs=5):
model.train()
for epoch in range(epochs):
total_loss = 0
for x1, x2, label in dataloader:
x1, x2, label = x1.cuda(), x2.cuda(), label.cuda()
out1, out2 = model(x1, x2)
loss = criterion(out1, out2, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: Loss {total_loss:.4f}")
# Inference for similar image retrieval
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def find_similar(query_img, dataset, model, top_k=5):
model.eval()
with torch.no_grad():
q_emb = model.forward_once(query_img.unsqueeze(0).cuda()).cpu().numpy()
embeddings = []
paths = []
for img_path in dataset.image_filenames:
img = Image.open(os.path.join(dataset.image_dir, img_path)).convert('RGB')
img_tensor = dataset.transform(img).unsqueeze(0).cuda()
emb = model.forward_once(img_tensor).cpu().numpy()
embeddings.append(emb)
paths.append(img_path)
similarities = cosine_similarity(q_emb, np.vstack(embeddings))[0]
top_indices = similarities.argsort()[::-1][:top_k]
return [paths[i] for i in top_indices], [similarities[i] for i in top_indices]
# Visualization of results (matplotlib)
import matplotlib.pyplot as plt
def show_similar(query_img, similar_imgs, sim_scores, dataset):
plt.figure(figsize=(15, 3))
plt.subplot(1, len(similar_imgs)+1, 1)
plt.imshow(query_img.permute(1, 2, 0))
plt.title("Query")
plt.axis('off')
for i, (img_path, score) in enumerate(zip(similar_imgs, sim_scores)):
img = Image.open(os.path.join(dataset.image_dir, img_path)).convert('RGB')
plt.subplot(1, len(similar_imgs)+1, i+2)
plt.imshow(img)
plt.title(f"Score: {score:.2f}")
plt.axis('off')
plt.show()
Web Search & Recommendation Application: Similar Product Recommendation
The system compares product images from a user’s browsing history with image features of all available products.
Items with the most similar vectors are recommended.
One-Shot Classification (Omniglot)
Allows classification of previously unseen classes using only a single sample.
Often evaluated using the Omniglot dataset, which contains character data from over 50 languages.
A representative use case of few-shot learning using the Siamese architecture.
# Library Installation
pip install torch transformers scikit-learn
# data
product_data = [
{"title": "Wireless Bluetooth Earphones with Noise Cancellation", "label": "audio"},
{"title": "Over-Ear Noise Cancelling Headphones", "label": "audio"},
{"title": "Smart Fitness Watch with Heart Rate Monitor", "label": "wearable"},
{"title": "Leather Smart Watch Band for Apple Watch", "label": "wearable"},
{"title": "Portable Bluetooth Speaker for Outdoors", "label": "audio"},
{"title": "Cotton Yoga Pants for Women", "label": "clothing"},
]
# siamese code
import random
def make_pairs(data):
pairs = []
for i in range(len(data)):
for j in range(i + 1, len(data)):
label = 1 if data[i]["label"] == data[j]["label"] else 0
pairs.append((data[i]["title"], data[j]["title"], label))
return pairs
from transformers import AutoTokenizer, AutoModel
import torch.nn as nn
import torch
class SiameseBERT(nn.Module):
def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
super().__init__()
self.bert = AutoModel.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
def encode(self, sentences):
encoded = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = self.bert(**encoded)
embeddings = model_output.last_hidden_state[:, 0, :] # CLS token
return embeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def recommend(query_text, product_data, model, top_k=3):
all_titles = [item["title"] for item in product_data]
embeddings = model.encode(all_titles)
query_emb = model.encode([query_text])
sims = cosine_similarity(query_emb, embeddings)[0]
top_idx = sims.argsort()[::-1][1:top_k+1]
return [(all_titles[i], sims[i]) for i in top_idx]
model = SiameseBERT()
query = "Compact Bluetooth Noise Cancelling Earbuds"
recommendations = recommend(query, product_data, model)
for title, score in recommendations:
print(f"{title} (Score: {score:.4f})")
Concrete Application Examples of Siamese Networks
1. Face Recognition and Authentication
FaceNet (Google) utilizes Triplet Loss to project a single face image into a high-dimensional embedding space, enabling person identification and clustering. This approach has achieved high accuracy in authentication and face search, and is used in smartphone face unlock systems and face clustering in Google Photos.
In contrast, DeepFace (Facebook) adopts an architecture close to a Siamese Network, focusing on determining whether two face images belong to the same individual. This technology is widely used for automatic face tagging in Facebook photos and for person identification and recommendation features.
2. Handwritten Signature Verification
SigNet (2016, Hafemann et al.) uses a CNN-based Siamese Network to verify the authenticity of handwritten signatures. By training on pairs of genuine and forged signatures using Contrastive Loss, it achieves high-accuracy signature verification. It captures fine handwriting traits, enabling individual-based identification.
This technology is increasingly used in secure document workflows in financial institutions and government agencies, such as verifying digital signatures, contract matching, and real-time signature authentication on tablets.
3. Medical Imaging and Diagnostic Support
In the ISIC Challenge for skin cancer, Siamese Networks are used to compute the similarity between a patient’s lesion image and known cancer cases, presenting visually similar cases to assist in diagnosis. This aids physicians in decision-making and provides a valuable second opinion for less experienced doctors.
In chest X-ray anomaly detection, patient X-rays are compared to historical patient databases to infer abnormalities or severity. Siamese architectures and metric learning capture subtle shadows or shape differences, enhancing diagnostic accuracy and reducing time.
4. Semantic Similarity in Natural Language Processing (NLP)
Sentence-BERT (SBERT) uses a Siamese structure with dual BERT encoders to generate fixed-length semantic vectors for each sentence. Cosine similarity between vectors is used to evaluate semantic closeness. SBERT enables efficient semantic search, sentence clustering, and FAQ matching, improving search precision and QA systems.
In the Quora Question Pairs (QQP) challenge, the task is to judge whether two questions have the same meaning. Siamese Networks are widely used for this, showing great performance in paraphrase detection and duplicate question identification.
5. Image Search and Recommendation
In e-commerce product search, systems recommend visually similar items by comparing user-selected product images with others. Platforms like Amazon and ZOZOTOWN use this for clothing and furniture, leveraging Siamese Networks to learn similarity and recommend based on visual similarity.
Pinterest’s Visual Search is a prime example, allowing users to select a region of an image and retrieve similar images based on composition, texture, and color. This uses a CNN-based Siamese architecture to embed images and rapidly search through hundreds of millions of images, enhancing recommendation accuracy.
6. Few-shot / One-shot Learning
Omniglot one-shot classification involves recognizing new handwritten characters using just one sample. Siamese Networks are standard for this task, learning to judge whether two character images belong to the same class based on distance in high-dimensional space—ideal for low-resource scenarios.
Mini-ImageNet classification deals with complex, natural images and few examples per class. Here, Siamese Networks are combined with meta-learning approaches and methods like Prototypical Networks that utilize class prototypes for robust few-shot learning. Siamese architectures provide a strong foundation for generalization in limited-data environments.
7. Speaker Recognition and Voice Authentication
Speaker verification systems use Siamese Networks to learn the similarity between different audio inputs to verify whether a voice matches a registered user. This technique powers VoiceID, Alexa, Google Home, and other voice assistants for user-specific personalization.
For example, Google Voice Match encodes users’ voice features using a Siamese encoder and matches them against stored profiles, enabling personalized responses even when multiple users share a single device.
8. General Biometrics (Fingerprint, Iris, Vein Recognition)
In fingerprint recognition, Siamese Networks extract feature vectors from fingerprint images and compare them via distance metrics. This is common in mobile biometric authentication (e.g., smartphones), where robustness to deformation or noise is key.
Iris recognition uses the pattern around the pupil for personal identification. High-security systems use Siamese Networks for rapid and accurate iris comparison, robust to occlusion or lighting changes.
9. Cybersecurity and Anomaly Detection
In network traffic pattern detection, Siamese Networks compare normal communication logs to current behavior to detect anomalies or attacks. By learning similarities over time-series or feature vectors, they identify previously unknown threats, surpassing rule-based detection systems.
In login behavior analysis, Siamese Networks compare current login patterns (time, IP, navigation) with historical user behavior to detect account hijacking or impersonation. Subtle deviations are captured via distance metrics, enabling detection of sophisticated threats and insider attacks.
10. Learning Support and Education
In similar question retrieval, systems automatically suggest structurally or semantically similar math or programming problems to students. Problems and code are vectorized, and Siamese Networks evaluate similarity to recommend personalized practice items based on student weaknesses.
For automated feedback generation, students’ answers are compared with model or previous answers using Siamese architectures to provide instant feedback or hints. This supports real-time individualized learning in chatbot-based tutoring or online education platforms.
Conclusion
Siamese Networks excel at “comparing unknown inputs to known references”, making them highly suitable for domains involving few samples, real-time identification, or non-classifiable inputs.
References for Siamese Networks
1. Original Works and Theoretical Foundations
-
Bromley et al. (1993, AT&T Bell Labs)
Title: Signature Verification using a “Siamese” Time Delay Neural Network
Summary: The foundational work on Siamese Networks. Introduced the concept of weight sharing and applied it to handwritten signature verification for the first time. -
Hadsell, Chopra, LeCun (2006, CVPR)
Title: Dimensionality Reduction by Learning an Invariant Mapping
Summary: Proposed the Contrastive Loss function and established the theoretical basis of metric learning. -
Hoffer & Ailon (2015)
Title: Deep Metric Learning using Triplet Network
Summary: Extended similarity learning using the Triplet Network structure, influencing methods like FaceNet.
2. Applied and Advanced Architectures
-
Schroff et al. (2015, Google)
Title: FaceNet
Summary: Applied Triplet Loss for face recognition and clustering. A prominent extension of the Siamese architecture. -
Koch et al. (2015)
Title: Siamese Neural Networks for One-shot Image Recognition
Summary: Successfully applied Siamese Networks to one-shot classification using the Omniglot dataset. A pioneering work in few-shot learning. -
Vinyals et al. (2016, NIPS)
Title: Matching Networks for One-shot Learning
Summary: Combined attention mechanisms with similarity functions for one-shot learning. Closely related to Siamese structures. -
Snell et al. (2017, NeurIPS)
Title: Prototypical Networks
Summary: Introduced few-shot classification based on the distance to class prototypes. A simplified alternative to Siamese Networks. -
Hun Hu et al. (2018, CVPR)
Title: Relation Networks for Object Detection
Summary: Learned the relationship between input pairs rather than relying on predefined distance metrics. An extension of the Siamese concept.
3. Implementation Resources and Practical Guides
-
François Chollet
Book: Deep Learning with Python (2nd Edition)
Summary: Includes examples of implementing Siamese architectures and Contrastive Loss using Keras. -
Aurélien Géron
Book: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Edition)
Summary: Provides a detailed explanation of Siamese and Triplet learning and their applications. -
Kevin Musgrave
GitHub: PyTorch Metric Learning
Summary: A comprehensive library of metric learning implementations including Siamese, Triplet, and ArcFace. -
Reimers & Gurevych (2019)
Library: HuggingFace Sentence-Transformers (SBERT)
Summary: A library for evaluating semantic similarity using a Siamese BERT architecture. Supports multiple languages and is rich in practical implementations.
4. Benchmark Datasets for Experimentation
-
Omniglot
Use: Few-shot classification
Summary: Contains 1,623 classes of handwritten characters. A standard benchmark for one-shot learning. -
LFW (Labeled Faces in the Wild)
Use: Face recognition
Summary: A face image pair classification dataset. Widely used to evaluate FaceNet and Siamese Networks. -
Quora Question Pairs (QQP)
Use: Semantic similarity in NLP
Summary: A binary classification task to determine whether two questions have the same meaning. Frequently used to evaluate models like SBERT. -
ISIC (Skin Lesion Dataset)
Use: Medical image similarity learning
Summary: A dataset for aiding diagnosis of skin lesions. Used in training models to learn visual similaritybetween medical images.
コメント