Part 10 - Implementing Naive Bayes in Python

Machine Learning Algorithms Series - Classification with scikit-learn

Feb 06, 2025

This article explains how to implement a Gaussian Naive Bayes classifier in Python. It covers importing necessary libraries, preparing data, training the model, making predictions, and evaluating performance using accuracy scores and confusion matrices.

Introduction to Naive Bayes

The Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem, which assumes that the features are conditionally independent given the class label. Despite this naive assumption, it often performs well in text classification and spam detection tasks. The Gaussian Naive Bayes algorithm is particularly used for continuous data, assuming that each feature follows a Gaussian (normal) distribution.

Step-by-Step Implementation

Importing Libraries:
- Import the GaussianNB class from sklearn.naive_bayes.
- Import train_test_split from sklearn.model_selection to split the dataset into training and testing sets.
- Import accuracy_score and confusion_matrix from sklearn.metrics to evaluate the model.
- Import numpy for numerical computations.
Preparing Data:
- Create a NumPy array X representing the hours studied and prior grades of students.
- Create a NumPy array y representing the binary outcomes (0 for fail, 1 for pass).
Splitting the Data:
- Use train_test_split to divide the data into training and testing sets.
- Specify test_size=0.2 to use 20% of the data for testing and 80% for training.
- Set random_state=42 for reproducibility.
Initializing and Training the Model:
- Create an instance of the GaussianNB class.
- Train the model using the training data (X_train, y_train). During training, the model calculates the mean and variance of each feature for each class (pass/fail) to define the Gaussian distribution used for prediction.
Making Predictions:
- Use the trained Gaussian Naive Bayes model to make predictions on the test data X_test.
- The model predicts either 0 (fail) or 1 (pass) based on the highest calculated probability for each class.
Evaluating the Model:
- Calculate the accuracy of the model by comparing the actual values y_test to the predicted values y_pred using accuracy_score.
- Compute the confusion matrix to understand true positives, true negatives, false positives, and false negatives.
- Print the accuracy and confusion matrix.

Complete Code Example

# Import necessary libraries
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np

# Prepare data
X = np.array([[,], [,], [,], [,], [,], [,], [,], [,], [,], [,]])
y = np.array()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Gaussian Naive Bayes classifier
model = GaussianNB()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion_matrix)

Get started with the Structured Learning

Conclusion

This article demonstrates a complete workflow for using Gaussian Naive Bayes to classify students as passing or failing based on hours studied and grades. The model is evaluated using both accuracy and confusion metrics, providing a comprehensive view of its classification performance.

School of AI | Newsletter

Discussion about this post