Part 22 - One-Class SVM for Anomaly Detection
Machine Learning Algorithms Series - Implementing Anomaly Detection with One-Class SVM in Python
Following our series on machine learning algorithms, this article introduces One-Class Support Vector Machines (SVM), a powerful technique for anomaly detection. Unlike traditional classification methods, One-Class SVM is designed for scenarios where the dataset primarily contains data from a single class, and the goal is to identify outliers that deviate significantly from this norm. This approach separates the data into high-density regions (normal data) and sparse regions (anomalies), making it particularly useful when you want to detect data points that differ significantly from the normal distribution of your data.
Understanding One-Class SVM
Anomaly Detection: One-Class SVM is primarily used for identifying data points that don't conform to the normal patterns in a dataset.
Single-Class Focus: It's designed for datasets where most data belongs to one class, and you're interested in finding outliers.
Density Separation: The algorithm separates data into high-density regions (normal data) and sparse regions (anomalies).
Applications: Useful in scenarios like fraud detection, identifying defective products, or detecting unusual activity in network traffic.
Step-by-Step Implementation
Import Libraries:
sklearn.svm
: Imports theOneClassSVM
class for anomaly detection.numpy
: Used for generating random data and manipulating arrays.
Generate Sample Data:
Create training data from a normal distribution using
np.random.randn
. This represents the "normal" data.Create test data by combining normal data (similar to training) with uniformly distributed random points (outliers) using
np.random.uniform
. This simulates a real-world scenario where you have a mix of normal and anomalous data points.
Initialize and Train the Model:
Initialize the
OneClassSVM
model:gamma='auto'
: Automatically sets the kernel coefficient based on the number of features. This parameter controls the influence of each data point, with higher values leading to smaller decision regions.nu=0.1
: Represents an upper bound on the fraction of training errors (anomalies) and a lower bound on the fraction of support vectors. It assumes up to 10% of your data might be anomalous.
Train the model using
model.fit(X_train)
. This step allows the model to learn the characteristics of the normal data.
Make Predictions:
Use the trained model to predict on the test data using
model.predict(X_test)
.The model assigns
-1
to points classified as anomalies and1
to normal points.
Display Predictions:
Print the results to see which points are classified as normal (
1
) or anomalies (-1
). This helps you understand how well the model is identifying outliers in your data.
Complete Code Example:
from sklearn.svm import OneClassSVM
import numpy as np
# Generate sample data
X = 0.3 * np.random.randn(100, 2) # 100 normal data points
X_train = np.vstack([X + 2, X - 2]) # Training data: two clusters of normal points
# Generate test data with outliers
X_test = np.vstack([X + 2, X - 2, np.random.uniform(-6, 6, size=(20, 2))]) # Test data: normal points plus 20 outliers
# Initialize and train the model
model = OneClassSVM(gamma='auto', nu=0.1)
model.fit(X_train)
# Make predictions
predictions = model.predict(X_test)
# Display predictions
print(predictions)
Conclusion
One-Class SVM provides an effective method for identifying anomalies when you primarily have data from a single class. By separating normal data from outliers, it's valuable in applications like fraud detection or identifying defective products. The implementation is straightforward, making it a practical choice for anomaly detection tasks.