Part 18 - Implementing Self-Training in Python

Machine Learning Algorithms Series - Semi-Supervised Learning with scikit-learn

Feb 06, 2025

This article explains how to implement self-training, a semi-supervised learning approach, in Python. Self-training leverages a small labeled dataset alongside a larger unlabeled dataset. The model is initially trained on the labeled data, and then it makes predictions on the unlabeled data. Confident predictions (those with high certainty) are added to the labeled dataset, and the process is repeated to improve the model. The article covers importing necessary libraries, generating a synthetic dataset, splitting the data into labeled and unlabeled portions, initializing and training the model with labeled data, performing self-training on unlabeled data, and evaluating the model on a test set.

Step-by-Step Implementation

Importing Libraries:
- Import the RandomForestClassifier from sklearn.ensemble.
- Import make_classification from sklearn.datasets.
- Import train_test_split from sklearn.model_selection.
- Import accuracy_score from sklearn.metrics.
- Import numpy as np for numerical operations.
Generating a Synthetic Dataset:
- Use make_classification to generate a synthetic dataset with a specified number of samples, features, and a random state for reproducibility. For example:

X, y = make_classification(n_samples=200, n_features=5, random_state=42)

Splitting the Data into Labeled and Unlabeled Portions:
- Use train_test_split to split the dataset into labeled and unlabeled portions. For example, use 30% of the data as labeled for initial training and 70% as unlabeled:

X_labeled, X_unlabeled, y_labeled = train_test_split(X, y, test_size=0.7, random_state=42)

Initializing and Training the Model with Labeled Data:
- Initialize a RandomForestClassifier with a random state for reproducibility:

model = RandomForestClassifier(random_state=42)

- Train the model on the initially labeled data:

model.fit(X_labeled, y_labeled)

Performing Self-Training on Unlabeled Data:
- Loop to repeat the self-training process multiple times.
- Predict the probabilities of the unlabeled data using model.predict_proba(X_unlabeled).
- Identify samples where the model's maximum predicted probability exceeds a threshold (e.g., 0.9), indicating high confidence.
- Add the high-confidence samples from X_unlabeled to the X_labeled dataset.
- Add the corresponding predicted labels for these high-confidence samples to y_labeled.
- Remove the confident samples from the X_unlabeled dataset.
- Retrain the model on the expanded labeled dataset.
Final Evaluation on a Test Set:
- Split the original data into training and test sets using train_test_split.
- Train the model on the full training set.
- Predict labels for the test set.
- Calculate the accuracy of the predictions on the test set using accuracy_score.
- Print the final accuracy score.

Complete Code Example

# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Generate a synthetic dataset
X, y = make_classification(n_samples=200, n_features=5, random_state=42)

# Split the data into labeled and unlabeled portions
X_labeled, X_unlabeled, y_labeled = train_test_split(X, y, test_size=0.7, random_state=42)

# Initialize and train the model with labeled data
model = RandomForestClassifier(random_state=42)
model.fit(X_labeled, y_labeled)

# Perform self-training on unlabeled data
for _ in range(5):
    probs = model.predict_proba(X_unlabeled)
    high_confidence_idx = np.where(np.max(probs, axis=1) > 0.9)

    X_labeled = np.vstack([X_labeled, X_unlabeled[high_confidence_idx]])
    y_labeled = np.hstack([y_labeled, np.argmax(probs[high_confidence_idx], axis=1)])

    X_unlabeled = np.delete(X_unlabeled, high_confidence_idx, axis=0)

    model.fit(X_labeled, y_labeled)

# Final evaluation on a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Get started with the Structured Learning

Conclusion

This article demonstrates how to implement self-training, a semi-supervised learning technique, using Python and scikit-learn. By iteratively adding confident predictions from the unlabeled data to the labeled dataset, the model can improve its performance with limited labeled data. This approach is particularly useful when labeled data is scarce and obtaining more labels is costly or time-consuming.

School of AI | Newsletter

Discussion about this post