Part 18 - Implementing Self-Training in Python
Machine Learning Algorithms Series - Semi-Supervised Learning with scikit-learn
This article explains how to implement self-training, a semi-supervised learning approach, in Python. Self-training leverages a small labeled dataset alongside a larger unlabeled dataset. The model is initially trained on the labeled data, and then it makes predictions on the unlabeled data. Confident predictions (those with high certainty) are added to the labeled dataset, and the process is repeated to improve the model. The article covers importing necessary libraries, generating a synthetic dataset, splitting the data into labeled and unlabeled portions, initializing and training the model with labeled data, performing self-training on unlabeled data, and evaluating the model on a test set.
Step-by-Step Implementation
Importing Libraries:
Import the
RandomForestClassifier
fromsklearn.ensemble
.Import
make_classification
fromsklearn.datasets
.Import
train_test_split
fromsklearn.model_selection
.Import
accuracy_score
fromsklearn.metrics
.Import
numpy
asnp
for numerical operations.
Generating a Synthetic Dataset:
Use
make_classification
to generate a synthetic dataset with a specified number of samples, features, and a random state for reproducibility. For example:
X, y = make_classification(n_samples=200, n_features=5, random_state=42)
Splitting the Data into Labeled and Unlabeled Portions:
Use
train_test_split
to split the dataset into labeled and unlabeled portions. For example, use 30% of the data as labeled for initial training and 70% as unlabeled:
X_labeled, X_unlabeled, y_labeled = train_test_split(X, y, test_size=0.7, random_state=42)
Initializing and Training the Model with Labeled Data:
Initialize a
RandomForestClassifier
with a random state for reproducibility:
model = RandomForestClassifier(random_state=42)
Train the model on the initially labeled data:
model.fit(X_labeled, y_labeled)
Performing Self-Training on Unlabeled Data:
Loop to repeat the self-training process multiple times.
Predict the probabilities of the unlabeled data using
model.predict_proba(X_unlabeled)
.Identify samples where the model's maximum predicted probability exceeds a threshold (e.g., 0.9), indicating high confidence.
Add the high-confidence samples from
X_unlabeled
to theX_labeled
dataset.Add the corresponding predicted labels for these high-confidence samples to
y_labeled
.Remove the confident samples from the
X_unlabeled
dataset.Retrain the model on the expanded labeled dataset.
Final Evaluation on a Test Set:
Split the original data into training and test sets using
train_test_split
.Train the model on the full training set.
Predict labels for the test set.
Calculate the accuracy of the predictions on the test set using
accuracy_score
.Print the final accuracy score.
Complete Code Example
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Generate a synthetic dataset
X, y = make_classification(n_samples=200, n_features=5, random_state=42)
# Split the data into labeled and unlabeled portions
X_labeled, X_unlabeled, y_labeled = train_test_split(X, y, test_size=0.7, random_state=42)
# Initialize and train the model with labeled data
model = RandomForestClassifier(random_state=42)
model.fit(X_labeled, y_labeled)
# Perform self-training on unlabeled data
for _ in range(5):
probs = model.predict_proba(X_unlabeled)
high_confidence_idx = np.where(np.max(probs, axis=1) > 0.9)
X_labeled = np.vstack([X_labeled, X_unlabeled[high_confidence_idx]])
y_labeled = np.hstack([y_labeled, np.argmax(probs[high_confidence_idx], axis=1)])
X_unlabeled = np.delete(X_unlabeled, high_confidence_idx, axis=0)
model.fit(X_labeled, y_labeled)
# Final evaluation on a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Conclusion
This article demonstrates how to implement self-training, a semi-supervised learning technique, using Python and scikit-learn. By iteratively adding confident predictions from the unlabeled data to the labeled dataset, the model can improve its performance with limited labeled data. This approach is particularly useful when labeled data is scarce and obtaining more labels is costly or time-consuming.