Overview of self-supervised learning and various algorithms and implementation examples

Machine Learning Artificial Intelligence Digital Transformation Stochastic Generative Models Bayesian Modeling Natural Language Processing Markov Chain Monte Carlo Method Image Information Processing Reinforcement Learning Knowledge Information Processing Explainable Machine Learning Deep Learning General ML Small Data ML Physics & Mathematics Navigation of this blog

Self-supervised learning

Self-Supervised Learning is a type of machine learning and can be considered a type of supervised learning. While supervised learning uses labeled data to train models, self-supervised learning uses the data itself instead of labels to train models.

The idea of self-supervised learning is to train a model using the information and structure inside the data.

Generating a teacher signal: Generate a teacher signal from the data and use it to train the model. For example, in the case of textual data, hiding parts of the data within a sentence and having the model predict the hidden parts can foster the ability to understand the context.
Autoregressive models: Build a model that takes some data as input and predicts subsequent data. In this way, the model learns patterns and structures in the data and acquires the ability to predict future data.
Adversarial Generative Network (GAN): GAN described in “Overview of GANs and their various applications and implementations” would be a network with a structure in which generative and discriminative models compete with each other. The generative model generates data, the discriminative model attempts to distinguish between real data and generated data, and this competition allows the generative model to learn features of the real data.

Self-supervised learning is particularly useful for effectively training models on data sets with few labels, for example, a method that can use a large unlabeled data set to learn the features and structure of that data and then use a small amount of labeled data to make task-specific adjustments It will be.

Self-supervised learning has been applied to a variety of data modalities, including image, text, and audio, and its effectiveness has been widely studied, making this technique an important part of the modern machine learning evolution.

Algorithms used in self-supervised learning

Various algorithms exist for self-supervised learning. Some of the most common algorithms are described below.

Masked Language Model (MLM): This is a method for self-supervised learning on textual data and is used in BERT (Bidirectional Encoder Representations from Transformers) desribed in “BERT Overview, Algorithms, and Example Implementations“. It involves training models on tasks such as masking some words in a text and restoring the masked portions to the original words, thereby allowing the model to acquire the ability to understand context and grammar.
Autoregressive Models: These are methods used for sequential data (text, speech, time-series data, etc.), such as RNN (Recurrent Neural Networks) as described in “Overview of RNN and examples of algorithms and implementations”, LSTM (Long Short-Term Memory) described in “Overview of LSTM and Examples of Algorithms and Implementations“, and GRUs (Gated Recurrent Unit) described in “Overview of GRUs and examples of algorithms and implementations“. Recurrent Unit (GRU) are typical algorithms, in which models are trained to predict the next data using the past information.
Contrastive Learning: Algorithms for image and text data learn data features by combining data from different views into the same class. Typical methods include SimCLR (Simple Contrastive Learning) and MoCo (Memory-Augmented Contrastive).
Generative Adversarial Networks (GANs): GANs used for image generation and other applications have a structure in which the generative model competes with the discriminative model. The generative model generates data that resembles the real data, while the discriminative model attempts to distinguish between the real data and the generated data. This process allows the generative model to learn features of the real data.
Spatial Autoencoders: A self-supervised learning method for image data in which a portion of the image is hidden as input data and the model is trained to reconstruct the hidden portion. This allows the model to improve its ability to capture features and structures in the image.

Libraries and platforms used for self-supervised learning

Libraries and platforms used for self-supervised learning are widely used in the machine learning community. Some representative libraries and platforms are described below.

PyTorch: PyTorch is an open source library for deep learning that is widely used to implement self-supervised learning methods.
TensorFlow: TensorFlow is an open source deep learning library developed by Google that also supports self-supervised learning methods.
Hugging Face Transformers: Hugging Face provides a library dedicated to natural language processing tasks.” The Transformers described in “Overview of Transformer Models, Algorithms, and Examples of Implementations“ library provides easy access to self-supervised learning models, including models such as BERT and GPT described in “Overview of GPT and examples of algorithms and implementations“.
fastai: fastai is a high-level deep learning library based on PyTorch that provides utilities to easily implement self-supervised learning approaches.
Facebook AI Research’s Fairseq: Fairseq is a library developed by Facebook AI Research that specializes in sequence-to-sequence tasks and is also used for self-supervised learning tasks.
OpenAI’s CLIP: OpenAI’s CLIP is an example of self-supervised learning, a model that relates and understands images and text, and can be used to relate different data modalities using CLIP.

Application of self-supervised learning

Self-supervised learning has been applied to a variety of data modalities and tasks. The following are examples of applications of self-supervised learning.

Natural Language Processing (NLP):
- Language modeling: learning context and language structure through the prediction of words and phrases in text.
- Text Embedding: can learn semantic embedding of words and sentences through tasks that hide parts of text and recover them from their context.
Image Processing:
- Self-generated model: learns features and structures in an image by means of a task that hides parts of the image and reconstructs those parts. This includes, for example, Autoencoder, which masks and restores parts of an image.
- Contrastive learning: learns the features of an image by combining data from different views into the same class. This improves the ability to distinguish between similar images.
Speech Processing:
- Autoencoder: learns features of speech by masking and restoring parts of the speech data. This is useful for speech recognition and generation.
- Contrastive learning: can learn speech data from different utterances as the same class and extract speech features.
Anomaly Detection:
- In anomaly detection, normal data is trained on a self-supervised learning model, which is used to determine if new data is anomalous. Models trained only on normal data have the ability to detect anomalous patterns.
Semi-supervised learning:
- Self-supervised learning is also applied to semi-supervised learning, in which models are trained by combining unlabeled and labeled data. Supervisory signals generated from unlabeled data can be used to train the model.
Motion Recognition:
- Motion features can be extracted using self-supervised learning on motion data. This applies to sensor data, video data, etc.

These applications demonstrate the broad potential of self-supervised learning. The approach is particularly useful in situations where supervised learning is difficult, such as when data is not abundant, labels are constrained, or labels are expensive.

Example implementation of self-supervised learning applied to natural language processing

As an example of an implementation of self-supervised learning in natural language processing (NLP), we describe the BERT (Bidirectional Encoder Representations from Transformers) model for learning text embeddings. BERT is a self-supervised model learned from a large corpus of data, and is capable of transferring pre-trained features to other NLP tasks see “Overview of Transfer Learning, Algorithms, and Examples of Implementations..

Below we show a simple example implementation for fine-tuning a BERT model using the Hugging Face Transformers library. This example is for a text classification task (binary classification).

import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Data preparation (dummy data)
sentences = ["I like pizza.", "I hate broccoli.", "Pizza is delicious.", "Broccoli is gross."]
labels = [1, 0, 1, 0]  # 1: Positive, 0: Negative

# Data Division
train_sentences, val_sentences, train_labels, val_labels = train_test_split(sentences, labels, test_size=0.2, random_state=42)

# BERT tokenizer loading
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# Data tokenization and encoding
train_encodings = tokenizer(train_sentences, truncation=True, padding=True)
val_encodings = tokenizer(val_sentences, truncation=True, padding=True)

# BERT model loading
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# data loader
train_dataset = torch.utils.data.TensorDataset(torch.tensor(train_encodings['input_ids']),
                                              torch.tensor(train_encodings['attention_mask']),
                                              torch.tensor(train_labels))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True)

# optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)

# training loop
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids, attention_mask, batch_labels = batch
        outputs = model(input_ids, attention_mask=attention_mask, labels=batch_labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    
    # validation
    model.eval()
    with torch.no_grad():
        val_outputs = model(val_encodings['input_ids'], attention_mask=val_encodings['attention_mask'])
        val_predictions = torch.argmax(val_outputs.logits, dim=1)
        val_accuracy = accuracy_score(val_labels, val_predictions)
        print(f"Epoch {epoch+1}: Validation Accuracy = {val_accuracy:.4f}")

Example implementation applying self-supervised learning to image processing

As an example of implementation of self-supervised learning in image processing, we describe an implementation of an image denoising autoencoder, one of the self-generating models. In this example, a self-supervised learning model for image denoising is constructed for the numeric images in the MNIST dataset.

An example implementation of the image denoising autoencoder using PyTorch is shown below.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt

# Data preprocessing and loading
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.GaussianBlur(kernel_size=3)  # Make data noisy
])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)

# Definition of auto encoder
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU()
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, 28*28),
            nn.Sigmoid()  # Output values in the range of 0 to 1
        )
    
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

# Model Instantiation and Training
autoencoder = Autoencoder()
criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder.parameters(), lr=0.001)

num_epochs = 10
for epoch in range(num_epochs):
    for data in train_loader:
        img, _ = data
        img = img.view(img.size(0), -1)
        noisy_img = img + 0.1 * torch.randn(img.size())  # add noise (e.g. to a signal)
        
        optimizer.zero_grad()
        output = autoencoder(noisy_img)
        loss = criterion(output, img)
        loss.backward()
        optimizer.step()
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

# Display of denoised images
noisy_imgs, _ = iter(train_loader).next()
noisy_imgs = noisy_imgs.view(noisy_imgs.size(0), -1)
denoised_imgs = autoencoder(noisy_imgs)
denoised_imgs = denoised_imgs.view(-1, 1, 28, 28)

plt.figure(figsize=(10, 4))
for i in range(10):
    plt.subplot(2, 10, i+1)
    plt.imshow(noisy_imgs[i].reshape(28, 28), cmap='gray')
    plt.axis('off')
    
    plt.subplot(2, 10, i+11)
    plt.imshow(denoised_imgs[i].detach().numpy().reshape(28, 28), cmap='gray')
    plt.axis('off')

plt.tight_layout()
plt.show()

In this example, an autoencoder is constructed to add noise to the numeric images of the MNIST dataset and to remove the noise. The autoencoder takes a noised image as input, learns to remove the noise and restore the original image, and a comparison of the denoised and noised images is shown.

Example implementation applying self-supervised learning to speech processing

As an example of implementation of self-supervised learning in speech processing, we describe the implementation of a self-supervised speech generation model, WaveGAN, which is a GAN (Generative Adversarial Network) based method for generating speech waveforms.

Below is a simple example of WaveGAN implementation using PyTorch. In this example, WaveGAN is used to generate speech waveforms from short noises.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# hyperparameter
num_samples = 16000  # Samples per second
num_channels = 1  # monaural sound
latent_dim = 100  # Number of dimensions of the latent vector
# Generator Network
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        
        self.layers = nn.Sequential(
            nn.ConvTranspose1d(latent_dim, 256, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose1d(256, 128, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose1d(128, 64, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose1d(64, num_channels, kernel_size=4, stride=2, padding=1),
            nn.Tanh()
        )
    
    def forward(self, z):
        return self.layers(z)

# Noise generation
def generate_noise(num_samples, latent_dim):
    return torch.randn(num_samples, latent_dim)

# Generator instantiation
generator = Generator()

# Loss Functions and Optimizers
criterion = nn.MSELoss()
optimizer = optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999))

# learning loop
num_epochs = 200
for epoch in range(num_epochs):
    noise = generate_noise(num_samples, latent_dim)
    fake_samples = generator(noise)
    
    target = torch.zeros(fake_samples.size())
    loss = criterion(fake_samples, target)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

# Display of generated speech waveforms
generated_noise = generate_noise(num_samples, latent_dim)
generated_waveforms = generator(generated_noise).detach().numpy()

plt.figure(figsize=(10, 4))
for i in range(10):
    plt.subplot(2, 10, i+1)
    plt.plot(generated_waveforms[i].squeeze())
    plt.axis('off')

plt.tight_layout()
plt.show()

In this example implementation, WaveGAN is used to generate speech waveforms from noise. The generated speech waveforms are displayed graphically, and detailed WaveGAN settings and model improvements should be tailored to the actual data and task. When applying self-supervised learning to speech processing, it is important to design the generative and speech feature extraction models and train them with appropriate data sets and hyperparameters.

Example implementation applying self-supervised learning for anomaly detection

As an example of one implementation for applying self-supervised learning to anomaly detection, we describe an Autoencoder-based method for detecting anomalous network traffic. In this example, the KDD Cup 1999 dataset is used to detect anomalies in network traffic.

Below is a simple example implementation of an anomaly detection using Autoencoder with PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

# Load data (KDD Cup 1999 data set)
data = torch.load('kddcup_data.pt')  # Data must be prepared in advance

# Data Standardization
scaler = StandardScaler()
data = scaler.fit_transform(data)

# Split into training and test data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Definition of Autoencoder
class Autoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Autoencoder, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, input_dim),
            nn.ReLU()
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# hyperparameter
input_dim = train_data.shape[1]
hidden_dim = 16
learning_rate = 0.001
num_epochs = 20

# Model Instantiation and Training
model = Autoencoder(input_dim, hidden_dim)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    
    for data in train_data:
        optimizer.zero_grad()
        input_data = torch.tensor(data, dtype=torch.float32)
        output_data = model(input_data)
        loss = criterion(output_data, input_data)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {train_loss / len(train_data):.4f}")

# Abnormality detection with test data
model.eval()
with torch.no_grad():
    test_losses = []
    anomaly_scores = []
    for data in test_data:
        input_data = torch.tensor(data, dtype=torch.float32)
        output_data = model(input_data)
        test_loss = criterion(output_data, input_data)
        test_losses.append(test_loss.item())
        anomaly_scores.append(test_loss.item())
    
    threshold = torch.mean(torch.tensor(anomaly_scores)) + torch.std(torch.tensor(anomaly_scores))
    predicted_labels = [1 if score > threshold else 0 for score in anomaly_scores]

# evaluation
true_labels = [1 if loss > threshold else 0 for loss in test_losses]
conf_matrix = confusion_matrix(true_labels, predicted_labels)
report = classification_report(true_labels, predicted_labels)

print("Confusion Matrix:")
print(conf_matrix)
print("nClassification Report:")
print(report)

This example implementation uses the KDD Cup 1999 dataset to detect anomalies in network traffic; it uses the Autoencoder model to learn normal network traffic data and then computes an anomaly score on the test data to detect anomalies.

Example implementation of semi-supervised learning applied to self-supervised learning

As an example of one implementation of applying self-supervised learning to semi-supervised learning, we show how to introduce self-supervised learning when performing a simple semi-supervised classification task. In this example, semi-supervised learning is used to perform the classification task on the numeric images in the MNIST dataset.

An example implementation of semi-supervised learning using PyTorch is shown below.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Data preprocessing and loading
transform = transforms.Compose([
    transforms.ToTensor()
])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_data, unlabeled_data = train_test_split(train_dataset, test_size=0.8, random_state=42)

labeled_loader = torch.utils.data.DataLoader(dataset=train_data, batch_size=64, shuffle=True)
unlabeled_loader = torch.utils.data.DataLoader(dataset=unlabeled_data, batch_size=64, shuffle=True)

# Model Definition
class Classifier(nn.Module):
    def __init__(self):
        super(Classifier, self).__init__()
        
        self.layers = nn.Sequential(
            nn.Linear(28*28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.layers(x)

# Model instantiation and training (labeled data)
classifier = Classifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(classifier.parameters(), lr=0.001)

num_epochs = 10
for epoch in range(num_epochs):
    classifier.train()
    for data in labeled_loader:
        images, labels = data
        optimizer.zero_grad()
        outputs = classifier(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    # Evaluation with labeled data
    classifier.eval()
    with torch.no_grad():
        all_labels = []
        all_predictions = []
        for data in labeled_loader:
            images, labels = data
            outputs = classifier(images)
            predictions = torch.argmax(outputs, dim=1)
            all_labels.extend(labels.tolist())
            all_predictions.extend(predictions.tolist())
        
        accuracy = accuracy_score(all_labels, all_predictions)
        print(f"Epoch {epoch+1}: Labeled Data Accuracy = {accuracy:.4f}")

# Self-supervised learning with unlabeled data
classifier.train()
for epoch in range(num_epochs):
    for data in unlabeled_loader:
        images, _ = data
        optimizer.zero_grad()
        outputs = classifier(images)
        pseudo_labels = torch.argmax(outputs, dim=1)  # Temporary labels in self-supervised learning
        loss = criterion(outputs, pseudo_labels)
        loss.backward()
        optimizer.step()

    # Evaluation on unlabeled data
    classifier.eval()
    with torch.no_grad():
        all_labels = []
        all_predictions = []
        for data in unlabeled_loader:
            images, _ = data
            outputs = classifier(images)
            predictions = torch.argmax(outputs, dim=1)
            all_predictions.extend(predictions.tolist())
        
        accuracy = accuracy_score(all_labels, all_predictions)
        print(f"Epoch {epoch+1}: Unlabeled Data Accuracy = {accuracy:.4f}")

In this implementation example, labeled and unlabeled data are used to perform the classification task using a combination of semi-supervised and self-supervised learning. The labeled data is used to train the classifier, and the output of the classifier is used to assign temporary labels to the unlabeled data and train the classifier again.

Implementation example of applying self-supervised learning to motion recognition

As an example of applying self-supervised learning to motion recognition, we describe an Autoencoder-based method for extracting motion features using accelerometer data. In this example, the “UCI HAR Dataset” provided in the UCI Machine Learning Repository is used for motion recognition.

Below is an example implementation of self-supervised learning for motion recognition using accelerometer data with PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Loading data (UCI HAR Dataset)
data = torch.load('uci_har_data.pt')  # Data must be prepared in advance

# Data Preprocessing
scaler = StandardScaler()
data = scaler.fit_transform(data)

# Split into training and test data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Definition of Autoencoder
class Autoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Autoencoder, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, input_dim),
            nn.ReLU()
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# hyperparameter
input_dim = train_data.shape[1]
hidden_dim = 16
learning_rate = 0.001
num_epochs = 20

# Model Instantiation and Training
model = Autoencoder(input_dim, hidden_dim)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    
    for data in train_data:
        optimizer.zero_grad()
        input_data = torch.tensor(data, dtype=torch.float32)
        output_data = model(input_data)
        loss = criterion(output_data, input_data)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {train_loss / len(train_data):.4f}")

# Abnormality detection with test data
model.eval()
with torch.no_grad():
    test_losses = []
    for data in test_data:
        input_data = torch.tensor(data, dtype=torch.float32)
        output_data = model(input_data)
        test_loss = criterion(output_data, input_data)
        test_losses.append(test_loss.item())
    
    threshold = torch.mean(torch.tensor(test_losses)) + torch.std(torch.tensor(test_losses))
    predicted_labels = [1 if loss > threshold else 0 for loss in test_losses]

# evaluation
true_labels = [1 if loss > threshold else 0 for loss in test_losses]
conf_matrix = confusion_matrix(true_labels, predicted_labels)
accuracy = np.sum(np.diag(conf_matrix)) / np.sum(conf_matrix)

print("Confusion Matrix:")
print(conf_matrix)
print("nAccuracy:", accuracy)

In this implementation example, motion recognition is performed using accelerometer data; the Autoencoder model is used to learn normal motion data, and anomaly scores are calculated on test data to detect anomalies.

Reference Information and Reference Books

See also “Small Data Learning, Combining Logic and Machine Learning, and Local/Group Learning” which discusses related approaches.

Self‑Supervised Learning: Teaching AI with Unlabeled Data by Robert Johnson (HiTeX Press, 2024) — Described as “a definitive guide to one of the most transformative developments in artificial intelligence… introduces readers to its principles and methodologies, which enable models to leverage vast amounts of unlabeled data effectively.”
A Cookbook of Self‑Supervised Learning
Introduction to Semi‑Supervised Learning — Although semi-supervised rather than self-supervised, it helps you understand label-scarce paradigms, which is contextually valuable.
Semi‑Supervised Learning — A deeper dive into semi-supervised methods; useful for contrast with SSL methods.
The Art of Self‑Directed Learning: 23 Tips for Giving Yourself an Unconventional Education — This one is not technical in the ML sense, but more about self-learning. Could be useful for meta-skills (how you learn the material).
Supervised Machine Learning with Python: Develop Rich Python Coding Practices While Exploring Supervised Machine Learning — Good to have as background if you’re familiar with supervised learning and now branching into SSL.

For reference books, see “Small Data Analysis and Machine Learning.

“Data Analytics: A Small Data Approach“