Overview of Siamese Networks, algorithms and implementation examples

Machine Learning Artificial Intelligence Digital Transformation Natural Language Processing Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Navigation of this blog
Overview of Siamese Networks

A Siamese Network is a model architecture consisting of two (or more) identical neural networks arranged in parallel with shared weights. It is designed to learn and evaluate the similarity between inputs and was originally developed for tasks such as signature verification and face recognition (Bromley et al., 1993).

In a Siamese Network, each sub-network receives one of the inputs (e.g., image A and image B) and independently extracts a feature vector (embedding) from it. Since the weights are shared, both networks apply the same transformation to their respective inputs, ensuring fairness in the comparison.

The resulting feature vectors are then compared using a distance metric such as L1 distance (Manhattan distance), L2 distance (Euclidean distance), or cosine similarity, to compute the similarity between the two inputs.

To train the network based on this similarity, loss functions such as Contrastive Loss and Triplet Loss are commonly employed. Contrastive Loss encourages similar pairs to be closer together in the embedding space, while ensuring that dissimilar pairs are separated by a margin. Triplet Loss, on the other hand, uses three samples—Anchor, Positive, and Negative—and trains the network to maximize the distance between the Anchor and the Negative while minimizing the distance between the Anchor and the Positive.

In this way, Siamese Networks are specialized for learning a comparable feature space, and have been widely applied in tasks such as face recognition, signature verification, and similar image retrieval.

Related Algorithms

Siamese Networks are closely associated with similarity learning and metric learning. These models are based on the framework of measuring similarity between inputs in an embedding space, and numerous derived algorithms and applications have been proposed.

Classification by Loss Function

In similarity learning, various loss functions have been introduced to help the network learn what constitutes similarity or dissimilarity between data samples, typically based on distances or similarities.

Contrastive Loss

Trains the network so that the distance between embeddings is small for similar input pairs and large for dissimilar pairs. This encourages data from the same class to be close together, and data from different classes to be well separated in the feature space.

Formula:
  L = (1 – y) * d² + y * max(0, margin – d)²
where d is the distance between two samples, and y is the label (y = 0 for similar, y = 1 for dissimilar).

Triplet Loss

Uses three inputs: an Anchor, a Positive (same class), and a Negative (different class). It trains the network so that the distance between the Anchor and Positive is smaller than that between the Anchor and Negative.

Formula:
  L = max(d(a, p) – d(a, n) + margin, 0)
where a = anchor, p = positive, n = negative.

Quadruplet Loss

Extends the Triplet Loss by adding another Negative sample, introducing additional constraints not only between the Anchor and other samples but also between multiple Negative samples and Positive samples. This further sharpens the decision boundary.

Formula:
  L = Triplet_Loss + α * (d(n1, n2) – d(p1, p2))
where α is a hyperparameter to adjust constraint strength, (n1, n2) is the pair of negatives, and (p1, p2) is the pair of positives.

N-pair Loss

Handles one Anchor with multiple Negatives at once, enabling the network to learn discriminative features more efficiently across the mini-batch compared to contrastive or triplet-based methods.

Formula:
  L = log(1 + ∑exp(f⁺ᵀf⁻ – f⁺ᵀf⁺))
where f⁺ is the embedding of the positive sample, and f⁻ is that of the negatives.

Applications in Few-Shot Learning

Matching Networks

Leverages LSTM and attention mechanisms to classify a query sample based on its similarity to a support set (few labeled samples). It computes cosine similarity between the query and support set samples and predicts classes based on similarity weights. Like Siamese Networks, it relies on comparison-based classification.

Prototypical Networks

Computes a prototype for each class as the mean of the embeddings in the support set, then classifies queries based on their distance (typically Euclidean) to these prototypes. Unlike Siamese Networks which compare sample pairs, Prototypical Networks compare queries to class representatives—making it a simple yet powerful approach for few-shot classification.

Relation Networks

Instead of using a fixed distance metric, this method learns a relation module (a neural network) that outputs the similarity score between a query and support sample. This makes it more flexible than traditional Siamese approaches and suitable for capturing complex relationships such as visual or spatial similarity.

Meta-Learning Approaches (e.g., MAML)

Siamese-based models can also be combined with meta-learning techniques such as Model-Agnostic Meta-Learning (MAML). While MAML focuses on rapid adaptation to new tasks with few steps, incorporating Siamese structures enhances the underlying representation learning, enabling faster and more generalizable adaptation.

Extension to NLP: Siamese BERT (SBERT)

In natural language processing, Siamese architectures have been extended into models like Siamese BERT (SBERT). SBERT uses two parallel BERT encoders to independently embed a pair of sentences. Their cosine similarity is then used to assess semantic closeness. SBERT has shown strong performance in semantic textual similarity (STS), retrieval, and ranking tasks.

Example of Application Implementations

Siamese Networks are used in a variety of fields, including vision, natural language processing, biometrics, medicine, and search. Examples of implementations are described below.

Example of image recognition application: Comparison of handwritten numbers using Siamese Networks (MNIST)

Technologies used:

TensorFlow/Keras or PyTorch

MNIST dataset

Contrastive Loss or Triplet Loss

Implementation Overview:

Create MNIST image pairs (positive examples: same number, negative examples: different numbers)

Build a CNN-based Siamese Network

Finally, measure the distance between two feature vectors by L2

Apply Contrastive Loss to the loss function

After training, measure similarity between new images

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader
import random
import numpy as np
from PIL import Image

# 1. Data preprocessing and pair generation
class SiameseMNIST(Dataset):
    def __init__(self, mnist_dataset):
        self.mnist_dataset = mnist_dataset
        self.targets = mnist_dataset.targets
        self.data = mnist_dataset.data

    def __getitem__(self, index):
        img1, label1 = self.data[index], int(self.targets[index])
        # Randomly select pairs with the same label OR pairs with different labels
        should_get_same_class = random.randint(0, 1)
        while True:
            index2 = random.randint(0, len(self.data) - 1)
            label2 = int(self.targets[index2])
            if should_get_same_class == (label1 == label2):
                break
        img2 = self.data[index2]

        transform = transforms.Compose([transforms.ToPILImage(), transforms.ToTensor()])
        return (transform(img1), transform(img2), torch.tensor([int(label1 == label2)], dtype=torch.float32))

    def __len__(self):
        return len(self.mnist_dataset)

# 2. Siamese network definition
class SiameseNetwork(nn.Module):
    def __init__(self):
        super(SiameseNetwork, self).__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 16, 3), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3), nn.ReLU(), nn.MaxPool2d(2)
        )
        self.fc = nn.Sequential(
            nn.Linear(32 * 5 * 5, 128),
            nn.ReLU(),
            nn.Linear(128, 64)
        )

    def forward_once(self, x):
        x = self.cnn(x)
        x = x.view(x.size()[0], -1)
        return self.fc(x)

    def forward(self, input1, input2):
        output1 = self.forward_once(input1)
        output2 = self.forward_once(input2)
        return output1, output2

# 3. Contrastive Loss definition
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, output1, output2, label):
        euclidean_distance = F.pairwise_distance(output1, output2)
        loss = torch.mean((1 - label) * torch.pow(euclidean_distance, 2) +
                          (label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2))
        return loss

# 4. learning
def train(model, dataloader, loss_fn, optimizer, epochs=5):
    for epoch in range(epochs):
        total_loss = 0
        for img1, img2, label in dataloader:
            img1, img2, label = img1.cuda(), img2.cuda(), label.cuda()
            output1, output2 = model(img1, img2)
            loss = loss_fn(output1, output2, label)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

# 5. execution
if __name__ == "__main__":
    transform = transforms.ToTensor()
    mnist_train = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
    siamese_dataset = SiameseMNIST(mnist_train)
    dataloader = DataLoader(siamese_dataset, shuffle=True, batch_size=64)

    model = SiameseNetwork().cuda()
    loss_fn = ContrastiveLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    train(model, dataloader, loss_fn, optimizer, epochs=5)

Implementation Example in Natural Language Processing: Sentence Similarity with Sentence-BERT (SBERT)

Technologies Used:

  • Hugging Face Transformers

  • Pre-trained model: sentence-transformers/bert-base-nli-mean-tokens

  • Cosine similarity evaluation

Overview of Implementation:

  • Input any two sentences into BERT and obtain their sentence embedding vectors

  • Evaluate semantic similarity using cosine similarity (range: 0 to 1)

  • A high score indicates similarity, while a low score indicates dissimilarity

from sentence_transformers import SentenceTransformer, util

# 1. Model loading (separate models are possible for Japanese and other languages)
model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight and fast

# 2. Prepare a statement for comparison
sentences = [
    "This is a book about deep learning.",
    "This book explains neural networks in detail.",
    "I like to play soccer on weekends.",
]

# 3. Convert text to vectors
embeddings = model.encode(sentences, convert_to_tensor=True)

# 4. Calculate similarity (cosine similarity)
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

# 5. output
print("Similarity matrix between sentences:")
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"({i}, {j}): {cosine_scores[i][j]:.4f}")

Medical Application Example: Similar Image Retrieval for Skin Lesions (ISIC Dataset)

Patient images are compared with known skin lesion images using a Siamese Network, allowing the system to present similar past cases.

This contributes to clinical decision support for physicians.

Feature extraction from medical images is typically performed using ResNet or EfficientNet-based backbones.

# library
pip install torch torchvision scikit-learn matplotlib

# data definition
from torch.utils.data import Dataset
import os
from PIL import Image
import random
import torch
from torchvision import transforms

class ISICSiameseDataset(Dataset):
    def __init__(self, image_dir, labels_dict, transform=None):
        self.image_dir = image_dir
        self.labels_dict = labels_dict  # filename -> label
        self.transform = transform or transforms.ToTensor()
        self.image_filenames = list(labels_dict.keys())

    def __getitem__(self, idx):
        img1_name = self.image_filenames[idx]
        label1 = self.labels_dict[img1_name]

        # 同じ/異なるラベル画像をランダムに選択
        should_match = random.randint(0, 1)
        while True:
            img2_name = random.choice(self.image_filenames)
            label2 = self.labels_dict[img2_name]
            if (label1 == label2) == should_match:
                break

        img1 = Image.open(os.path.join(self.image_dir, img1_name)).convert('RGB')
        img2 = Image.open(os.path.join(self.image_dir, img2_name)).convert('RGB')

        return self.transform(img1), self.transform(img2), torch.tensor([int(label1 == label2)], dtype=torch.float32)

    def __len__(self):
        return len(self.image_filenames)

# Model Definition
import torch.nn as nn
import torchvision.models as models

class SiameseResNet(nn.Module):
    def __init__(self):
        super().__init__()
        base_model = models.resnet18(pretrained=True)
        base_model.fc = nn.Identity()  # Remove classification layer
        self.encoder = base_model
        self.fc = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128)
        )

    def forward_once(self, x):
        x = self.encoder(x)
        x = self.fc(x)
        return x

    def forward(self, x1, x2):
        out1 = self.forward_once(x1)
        out2 = self.forward_once(x2)
        return out1, out2

# Contrastive Loss
import torch.nn.functional as F

class ContrastiveLoss(nn.Module):
    def __init__(self, margin=2.0):
        super().__init__()
        self.margin = margin

    def forward(self, out1, out2, label):
        distance = F.pairwise_distance(out1, out2)
        loss = torch.mean((1 - label) * distance.pow(2) +
                          (label) * F.relu(self.margin - distance).pow(2))
        return loss

# learning loop
def train(model, dataloader, optimizer, criterion, epochs=5):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for x1, x2, label in dataloader:
            x1, x2, label = x1.cuda(), x2.cuda(), label.cuda()
            out1, out2 = model(x1, x2)
            loss = criterion(out1, out2, label)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch {epoch+1}: Loss {total_loss:.4f}")

# Inference for similar image retrieval
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_similar(query_img, dataset, model, top_k=5):
    model.eval()
    with torch.no_grad():
        q_emb = model.forward_once(query_img.unsqueeze(0).cuda()).cpu().numpy()
        embeddings = []
        paths = []

        for img_path in dataset.image_filenames:
            img = Image.open(os.path.join(dataset.image_dir, img_path)).convert('RGB')
            img_tensor = dataset.transform(img).unsqueeze(0).cuda()
            emb = model.forward_once(img_tensor).cpu().numpy()
            embeddings.append(emb)
            paths.append(img_path)

        similarities = cosine_similarity(q_emb, np.vstack(embeddings))[0]
        top_indices = similarities.argsort()[::-1][:top_k]
        return [paths[i] for i in top_indices], [similarities[i] for i in top_indices]

# Visualization of results (matplotlib)
import matplotlib.pyplot as plt

def show_similar(query_img, similar_imgs, sim_scores, dataset):
    plt.figure(figsize=(15, 3))
    plt.subplot(1, len(similar_imgs)+1, 1)
    plt.imshow(query_img.permute(1, 2, 0))
    plt.title("Query")
    plt.axis('off')
    for i, (img_path, score) in enumerate(zip(similar_imgs, sim_scores)):
        img = Image.open(os.path.join(dataset.image_dir, img_path)).convert('RGB')
        plt.subplot(1, len(similar_imgs)+1, i+2)
        plt.imshow(img)
        plt.title(f"Score: {score:.2f}")
        plt.axis('off')
    plt.show()

Web Search & Recommendation Application: Similar Product Recommendation

The system compares product images from a user’s browsing history with image features of all available products.

Items with the most similar vectors are recommended.

One-Shot Classification (Omniglot)

Allows classification of previously unseen classes using only a single sample.

Often evaluated using the Omniglot dataset, which contains character data from over 50 languages.

A representative use case of few-shot learning using the Siamese architecture.

# Library Installation
pip install torch transformers scikit-learn

# data
product_data = [
{"title": "Wireless Bluetooth Earphones with Noise Cancellation", "label": "audio"},
{"title": "Over-Ear Noise Cancelling Headphones", "label": "audio"},
{"title": "Smart Fitness Watch with Heart Rate Monitor", "label": "wearable"},
{"title": "Leather Smart Watch Band for Apple Watch", "label": "wearable"},
{"title": "Portable Bluetooth Speaker for Outdoors", "label": "audio"},
{"title": "Cotton Yoga Pants for Women", "label": "clothing"},
]

# siamese code
import random

def make_pairs(data):
    pairs = []
    for i in range(len(data)):
        for j in range(i + 1, len(data)):
            label = 1 if data[i]["label"] == data[j]["label"] else 0
            pairs.append((data[i]["title"], data[j]["title"], label))
    return pairs

from transformers import AutoTokenizer, AutoModel
import torch.nn as nn
import torch

class SiameseBERT(nn.Module):
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        super().__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    def encode(self, sentences):
        encoded = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
        with torch.no_grad():
            model_output = self.bert(**encoded)
        embeddings = model_output.last_hidden_state[:, 0, :]  # CLS token
        return embeddings

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend(query_text, product_data, model, top_k=3):
    all_titles = [item["title"] for item in product_data]
    embeddings = model.encode(all_titles)
    query_emb = model.encode([query_text])
    sims = cosine_similarity(query_emb, embeddings)[0]
    top_idx = sims.argsort()[::-1][1:top_k+1]
    return [(all_titles[i], sims[i]) for i in top_idx]

model = SiameseBERT()

query = "Compact Bluetooth Noise Cancelling Earbuds"
recommendations = recommend(query, product_data, model)

for title, score in recommendations:
    print(f"{title} (Score: {score:.4f})")
Concrete Application Examples of Siamese Networks

1. Face Recognition and Authentication

FaceNet (Google) utilizes Triplet Loss to project a single face image into a high-dimensional embedding space, enabling person identification and clustering. This approach has achieved high accuracy in authentication and face search, and is used in smartphone face unlock systems and face clustering in Google Photos.

In contrast, DeepFace (Facebook) adopts an architecture close to a Siamese Network, focusing on determining whether two face images belong to the same individual. This technology is widely used for automatic face tagging in Facebook photos and for person identification and recommendation features.

2. Handwritten Signature Verification

SigNet (2016, Hafemann et al.) uses a CNN-based Siamese Network to verify the authenticity of handwritten signatures. By training on pairs of genuine and forged signatures using Contrastive Loss, it achieves high-accuracy signature verification. It captures fine handwriting traits, enabling individual-based identification.

This technology is increasingly used in secure document workflows in financial institutions and government agencies, such as verifying digital signatures, contract matching, and real-time signature authentication on tablets.

3. Medical Imaging and Diagnostic Support

In the ISIC Challenge for skin cancer, Siamese Networks are used to compute the similarity between a patient’s lesion image and known cancer cases, presenting visually similar cases to assist in diagnosis. This aids physicians in decision-making and provides a valuable second opinion for less experienced doctors.

In chest X-ray anomaly detection, patient X-rays are compared to historical patient databases to infer abnormalities or severity. Siamese architectures and metric learning capture subtle shadows or shape differences, enhancing diagnostic accuracy and reducing time.

4. Semantic Similarity in Natural Language Processing (NLP)

Sentence-BERT (SBERT) uses a Siamese structure with dual BERT encoders to generate fixed-length semantic vectors for each sentence. Cosine similarity between vectors is used to evaluate semantic closeness. SBERT enables efficient semantic search, sentence clustering, and FAQ matching, improving search precision and QA systems.

In the Quora Question Pairs (QQP) challenge, the task is to judge whether two questions have the same meaning. Siamese Networks are widely used for this, showing great performance in paraphrase detection and duplicate question identification.

5. Image Search and Recommendation

In e-commerce product search, systems recommend visually similar items by comparing user-selected product images with others. Platforms like Amazon and ZOZOTOWN use this for clothing and furniture, leveraging Siamese Networks to learn similarity and recommend based on visual similarity.

Pinterest’s Visual Search is a prime example, allowing users to select a region of an image and retrieve similar images based on composition, texture, and color. This uses a CNN-based Siamese architecture to embed images and rapidly search through hundreds of millions of images, enhancing recommendation accuracy.

6. Few-shot / One-shot Learning

Omniglot one-shot classification involves recognizing new handwritten characters using just one sample. Siamese Networks are standard for this task, learning to judge whether two character images belong to the same class based on distance in high-dimensional space—ideal for low-resource scenarios.

Mini-ImageNet classification deals with complex, natural images and few examples per class. Here, Siamese Networks are combined with meta-learning approaches and methods like Prototypical Networks that utilize class prototypes for robust few-shot learning. Siamese architectures provide a strong foundation for generalization in limited-data environments.

7. Speaker Recognition and Voice Authentication

Speaker verification systems use Siamese Networks to learn the similarity between different audio inputs to verify whether a voice matches a registered user. This technique powers VoiceID, Alexa, Google Home, and other voice assistants for user-specific personalization.

For example, Google Voice Match encodes users’ voice features using a Siamese encoder and matches them against stored profiles, enabling personalized responses even when multiple users share a single device.

8. General Biometrics (Fingerprint, Iris, Vein Recognition)

In fingerprint recognition, Siamese Networks extract feature vectors from fingerprint images and compare them via distance metrics. This is common in mobile biometric authentication (e.g., smartphones), where robustness to deformation or noise is key.

Iris recognition uses the pattern around the pupil for personal identification. High-security systems use Siamese Networks for rapid and accurate iris comparison, robust to occlusion or lighting changes.

9. Cybersecurity and Anomaly Detection

In network traffic pattern detection, Siamese Networks compare normal communication logs to current behavior to detect anomalies or attacks. By learning similarities over time-series or feature vectors, they identify previously unknown threats, surpassing rule-based detection systems.

In login behavior analysis, Siamese Networks compare current login patterns (time, IP, navigation) with historical user behavior to detect account hijacking or impersonation. Subtle deviations are captured via distance metrics, enabling detection of sophisticated threats and insider attacks.

10. Learning Support and Education

In similar question retrieval, systems automatically suggest structurally or semantically similar math or programming problems to students. Problems and code are vectorized, and Siamese Networks evaluate similarity to recommend personalized practice items based on student weaknesses.

For automated feedback generation, students’ answers are compared with model or previous answers using Siamese architectures to provide instant feedback or hints. This supports real-time individualized learning in chatbot-based tutoring or online education platforms.

Conclusion
Siamese Networks excel at “comparing unknown inputs to known references”, making them highly suitable for domains involving few samples, real-time identification, or non-classifiable inputs.

References for Siamese Networks

1. Original Works and Theoretical Foundations

2. Applied and Advanced Architectures

  • Schroff et al. (2015, Google)
    Title: FaceNet
    Summary: Applied Triplet Loss for face recognition and clustering. A prominent extension of the Siamese architecture.

  • Koch et al. (2015)
    Title: Siamese Neural Networks for One-shot Image Recognition
    Summary: Successfully applied Siamese Networks to one-shot classification using the Omniglot dataset. A pioneering work in few-shot learning.

  • Vinyals et al. (2016, NIPS)
    Title: Matching Networks for One-shot Learning
    Summary: Combined attention mechanisms with similarity functions for one-shot learning. Closely related to Siamese structures.

  • Snell et al. (2017, NeurIPS)
    Title: Prototypical Networks
    Summary: Introduced few-shot classification based on the distance to class prototypes. A simplified alternative to Siamese Networks.

  • Hun Hu et al. (2018, CVPR)
    Title: Relation Networks for Object Detection
    Summary: Learned the relationship between input pairs rather than relying on predefined distance metrics. An extension of the Siamese concept.

3. Implementation Resources and Practical Guides

4. Benchmark Datasets for Experimentation

  • Omniglot
    Use: Few-shot classification
    Summary: Contains 1,623 classes of handwritten characters. A standard benchmark for one-shot learning.

  • LFW (Labeled Faces in the Wild)
    Use: Face recognition
    Summary: A face image pair classification dataset. Widely used to evaluate FaceNet and Siamese Networks.

  • Quora Question Pairs (QQP)
    Use: Semantic similarity in NLP
    Summary: A binary classification task to determine whether two questions have the same meaning. Frequently used to evaluate models like SBERT.

  • ISIC (Skin Lesion Dataset)
    Use: Medical image similarity learning
    Summary: A dataset for aiding diagnosis of skin lesions. Used in training models to learn visual similaritybetween medical images.

コメント

タイトルとURLをコピーしました