Intrusion Detection System Project

Project Overview

This project implements an Intrusion Detection System (IDS) to detect potential network intrusions using anomaly detection. The system processes network traffic data (e.g., from Wireshark captures) and employs the Isolation Forest algorithm from Scikit-learn to identify anomalous behavior that may indicate intrusions, such as unauthorized access or malicious activities.

Objectives

Analyze network traffic data to extract relevant features (e.g., packet size, protocol type, source/destination IPs).
Apply the Isolation Forest algorithm to detect anomalies in network traffic.
Provide a scalable and modular system for real-time or batch intrusion detection.
Include comprehensive documentation and instructions for setup and usage.

Skills and Tools

Skills: Cybersecurity, Machine Learning, Networking
Tools: Python, Wireshark, Scikit-learn, Pandas, NumPy

System Design

The IDS consists of the following components:

Data Collection: Network traffic data is captured using Wireshark and preprocessed into a structured format (e.g., CSV).
Feature Extraction: Key features like packet size, protocol, source/destination IPs, and ports are extracted.
Anomaly Detection: The Isolation Forest algorithm identifies outliers in the feature set, flagging potential intrusions.
Reporting: Anomalous events are logged with details for further analysis.

Workflow

Input: CSV file containing network traffic data (e.g., from Wireshark).
Preprocessing: Handle missing values, encode categorical features (e.g., protocols, IPs), and normalize numerical data.
Model Training: Train an Isolation Forest model on a baseline of normal network traffic.
Detection: Classify new traffic as normal or anomalous.
Output: Generate a report of detected anomalies with timestamps and details.

Requirements

Python 3.8+
Libraries: scikit-learn, pandas, numpy, argparse
Wireshark (for capturing network traffic)
Sample dataset (e.g., CSV with network traffic features)

Installation

Install Python: Ensure Python 3.8+ is installed.

Install Dependencies:

pip install scikit-learn pandas numpy argparse

Install Wireshark: Download and install Wireshark from https://www.wireshark.org/.
Prepare Dataset: Export network traffic from Wireshark to CSV or use a sample dataset (see sample_data.csv example below).

Dataset

The system expects a CSV file with the following columns:

timestamp: Packet timestamp (e.g., ISO format or epoch).
src_ip: Source IP address.
dst_ip: Destination IP address.
protocol: Protocol type (e.g., TCP, UDP, ICMP).
packet_size: Packet size in bytes.
src_port: Source port number.
dst_port: Destination port number.

Sample Dataset (sample_data.csv)

timestamp,src_ip,dst_ip,protocol,packet_size,src_port,dst_port
2025-08-17T10:00:00,192.168.1.10,192.168.1.1,TCP,150,49152,80
2025-08-17T10:00:01,192.168.1.11,8.8.8.8,UDP,68,53,53
...

Usage

Capture Network Traffic:
- Use Wireshark to capture network traffic.
- Export the capture to CSV with the required columns.

Run the IDS:

python ids.py --input sample_data.csv --output anomalies.csv

View Results:
- Check the anomalies.csv file for detected intrusions.
- Review the console output for a summary.

Code Implementation

ids.py

The main Python script for the IDS.

import pandas as pd import numpy as np from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler, LabelEncoder import argparse import warnings warnings.filterwarnings("ignore")

def preprocess_data(df): """Preprocess network traffic data.""" # Handle missing values df = df.dropna()

# Encode categorical features
le_protocol = LabelEncoder()
le_src_ip = LabelEncoder()
le_dst_ip = LabelEncoder()

df['protocol'] = le_protocol.fit_transform(df['protocol'])
df['src_ip'] = le_src_ip.fit_transform(df['src_ip'])
df['dst_ip'] = le_dst_ip.fit_transform(df['dst_ip'])

# Select features for anomaly detection
features = ['protocol', 'packet_size', 'src_port', 'dst_port', 'src_ip', 'dst_ip']
X = df[features]

# Normalize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

return X_scaled, df

def detect_anomalies(X_scaled, contamination=0.01): """Detect anomalies using Isolation Forest.""" model = IsolationForest(contamination=contamination, random_state=42) predictions = model.fit_predict(X_scaled) return predictions

def main(): # Parse command-line arguments parser = argparse.ArgumentParser(description="Intrusion Detection System") parser.add_argument('--input', type=str, required=True, help="Input CSV file with network traffic data") parser.add_argument('--output', type=str, default='anomalies.csv', help="Output CSV file for anomalies") args = parser.parse_args()

# Load data
try:
    df = pd.read_csv(args.input)
except FileNotFoundError:
    print(f"Error: Input file {args.input} not found.")
    return

# Preprocess data
X_scaled, df = preprocess_data(df)

# Detect anomalies
predictions = detect_anomalies(X_scaled)

# Add predictions to dataframe
df['anomaly'] = predictions
df['anomaly'] = df['anomaly'].map({1: 'Normal', -1: 'Anomaly'})

# Save anomalies to output file
anomalies = df[df['anomaly'] == 'Anomaly']
anomalies.to_csv(args.output, index=False)

# Print summary
print(f"Total packets analyzed: {len(df)}")
print(f"Anomalies detected: {len(anomalies)}")
print(f"Anomalies saved to: {args.output}")

if name == "main": main()

Code Explanation

Preprocessing: The preprocess_data function handles missing values, encodes categorical features (e.g., protocol, IPs) using LabelEncoder, and normalizes numerical features using StandardScaler.
Anomaly Detection: The detect_anomalies function uses the Isolation Forest algorithm with a contamination parameter (default: 0.01) to identify outliers.
Main Function: Parses command-line arguments, loads the input CSV, processes the data, detects anomalies, and saves results to an output CSV.

Running the System

Save the sample dataset as sample_data.csv.
Run the script:
bash
python ids.py --input sample_data.csv --output anomalies.csv
Check the anomalies.csv file for detected intrusions.

Example Output (anomalies.csv)

csv

timestamp,src_ip,dst_ip,protocol,packet_size,src_port,dst_port,anomaly
2025-08-17T10:00:05,192.168.1.10,10.0.0.1,TCP,1500,12345,23,Anomaly
...

Testing

Unit Testing: Test the preprocessing and anomaly detection functions with synthetic data.
Validation: Use a labeled dataset (e.g., NSL-KDD) to evaluate the model's accuracy.
Real-World Testing: Capture live traffic with Wireshark and test the system in a controlled environment.

Limitations

False Positives: The Isolation Forest may flag legitimate but rare traffic as anomalies.
Feature Selection: The system relies on predefined features; additional features (e.g., packet frequency) may improve accuracy.
Real-Time Processing: The current implementation is batch-based; real-time processing would require integration with a packet capture library like pcapy.

Future Improvements

Integrate real-time packet capture using scapy or pyshark.
Add more sophisticated features (e.g., packet inter-arrival time, entropy).
Implement a hybrid model combining supervised and unsupervised learning.
Develop a GUI for visualizing anomalies in real-time.

References

Scikit-learn Documentation: https://scikit-learn.org/
Wireshark User Guide: https://www.wireshark.org/docs/
Isolation Forest: Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." IEEE International Conference on Data Mining.

This project provides a complete, working IDS with documentation and code. You can extend it by adding real-time capture capabilities or integrating more advanced machine learning models. Let me know if you need help with specific enhancements or testing!

Search This Blog

Computer Science Project