Intrusion Detection System Project
Project Overview
This project implements an Intrusion Detection System (IDS) to detect potential network intrusions using anomaly detection. The system processes network traffic data (e.g., from Wireshark captures) and employs the Isolation Forest algorithm from Scikit-learn to identify anomalous behavior that may indicate intrusions, such as unauthorized access or malicious activities.
Objectives
Analyze network traffic data to extract relevant features (e.g., packet size, protocol type, source/destination IPs).
Apply the Isolation Forest algorithm to detect anomalies in network traffic.
Provide a scalable and modular system for real-time or batch intrusion detection.
Include comprehensive documentation and instructions for setup and usage.
Skills and Tools
Skills: Cybersecurity, Machine Learning, Networking
Tools: Python, Wireshark, Scikit-learn, Pandas, NumPy
System Design
The IDS consists of the following components:
Data Collection: Network traffic data is captured using Wireshark and preprocessed into a structured format (e.g., CSV).
Feature Extraction: Key features like packet size, protocol, source/destination IPs, and ports are extracted.
Anomaly Detection: The Isolation Forest algorithm identifies outliers in the feature set, flagging potential intrusions.
Reporting: Anomalous events are logged with details for further analysis.
Workflow
Input: CSV file containing network traffic data (e.g., from Wireshark).
Preprocessing: Handle missing values, encode categorical features (e.g., protocols, IPs), and normalize numerical data.
Model Training: Train an Isolation Forest model on a baseline of normal network traffic.
Detection: Classify new traffic as normal or anomalous.
Output: Generate a report of detected anomalies with timestamps and details.
Requirements
Python 3.8+
Libraries: scikit-learn, pandas, numpy, argparse
Wireshark (for capturing network traffic)
Sample dataset (e.g., CSV with network traffic features)
Installation
Install Python: Ensure Python 3.8+ is installed.
Install Dependencies:
pip install scikit-learn pandas numpy argparseInstall Wireshark: Download and install Wireshark from https://www.wireshark.org/.
Prepare Dataset: Export network traffic from Wireshark to CSV or use a sample dataset (see sample_data.csv example below).
Dataset
The system expects a CSV file with the following columns:
timestamp: Packet timestamp (e.g., ISO format or epoch).
src_ip: Source IP address.
dst_ip: Destination IP address.
protocol: Protocol type (e.g., TCP, UDP, ICMP).
packet_size: Packet size in bytes.
src_port: Source port number.
dst_port: Destination port number.
Sample Dataset (sample_data.csv)
timestamp,src_ip,dst_ip,protocol,packet_size,src_port,dst_port
2025-08-17T10:00:00,192.168.1.10,192.168.1.1,TCP,150,49152,80
2025-08-17T10:00:01,192.168.1.11,8.8.8.8,UDP,68,53,53
...Usage
Capture Network Traffic:
Use Wireshark to capture network traffic.
Export the capture to CSV with the required columns.
Run the IDS:
python ids.py --input sample_data.csv --output anomalies.csvView Results:
Check the anomalies.csv file for detected intrusions.
Review the console output for a summary.
Code Implementation
ids.py
The main Python script for the IDS.
import pandas as pd import numpy as np from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler, LabelEncoder import argparse import warnings warnings.filterwarnings("ignore")
def preprocess_data(df): """Preprocess network traffic data.""" # Handle missing values df = df.dropna()
# Encode categorical features
le_protocol = LabelEncoder()
le_src_ip = LabelEncoder()
le_dst_ip = LabelEncoder()
df['protocol'] = le_protocol.fit_transform(df['protocol'])
df['src_ip'] = le_src_ip.fit_transform(df['src_ip'])
df['dst_ip'] = le_dst_ip.fit_transform(df['dst_ip'])
# Select features for anomaly detection
features = ['protocol', 'packet_size', 'src_port', 'dst_port', 'src_ip', 'dst_ip']
X = df[features]
# Normalize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
return X_scaled, dfdef detect_anomalies(X_scaled, contamination=0.01): """Detect anomalies using Isolation Forest.""" model = IsolationForest(contamination=contamination, random_state=42) predictions = model.fit_predict(X_scaled) return predictions
def main(): # Parse command-line arguments parser = argparse.ArgumentParser(description="Intrusion Detection System") parser.add_argument('--input', type=str, required=True, help="Input CSV file with network traffic data") parser.add_argument('--output', type=str, default='anomalies.csv', help="Output CSV file for anomalies") args = parser.parse_args()
# Load data
try:
df = pd.read_csv(args.input)
except FileNotFoundError:
print(f"Error: Input file {args.input} not found.")
return
# Preprocess data
X_scaled, df = preprocess_data(df)
# Detect anomalies
predictions = detect_anomalies(X_scaled)
# Add predictions to dataframe
df['anomaly'] = predictions
df['anomaly'] = df['anomaly'].map({1: 'Normal', -1: 'Anomaly'})
# Save anomalies to output file
anomalies = df[df['anomaly'] == 'Anomaly']
anomalies.to_csv(args.output, index=False)
# Print summary
print(f"Total packets analyzed: {len(df)}")
print(f"Anomalies detected: {len(anomalies)}")
print(f"Anomalies saved to: {args.output}")if name == "main": main()
Code Explanation
- Preprocessing: The preprocess_data function handles missing values, encodes categorical features (e.g., protocol, IPs) using LabelEncoder, and normalizes numerical features using StandardScaler.
- Anomaly Detection: The detect_anomalies function uses the Isolation Forest algorithm with a contamination parameter (default: 0.01) to identify outliers.
- Main Function: Parses command-line arguments, loads the input CSV, processes the data, detects anomalies, and saves results to an output CSV.
Running the System
- Save the sample dataset as sample_data.csv.
- Run the script:
bashpython ids.py --input sample_data.csv --output anomalies.csv
- Check the anomalies.csv file for detected intrusions.
Example Output (anomalies.csv)
timestamp,src_ip,dst_ip,protocol,packet_size,src_port,dst_port,anomaly
2025-08-17T10:00:05,192.168.1.10,10.0.0.1,TCP,1500,12345,23,Anomaly
...Testing
- Unit Testing: Test the preprocessing and anomaly detection functions with synthetic data.
- Validation: Use a labeled dataset (e.g., NSL-KDD) to evaluate the model's accuracy.
- Real-World Testing: Capture live traffic with Wireshark and test the system in a controlled environment.
Limitations
- False Positives: The Isolation Forest may flag legitimate but rare traffic as anomalies.
- Feature Selection: The system relies on predefined features; additional features (e.g., packet frequency) may improve accuracy.
- Real-Time Processing: The current implementation is batch-based; real-time processing would require integration with a packet capture library like pcapy.
Future Improvements
- Integrate real-time packet capture using scapy or pyshark.
- Add more sophisticated features (e.g., packet inter-arrival time, entropy).
- Implement a hybrid model combining supervised and unsupervised learning.
- Develop a GUI for visualizing anomalies in real-time.
References
- Scikit-learn Documentation: https://scikit-learn.org/
- Wireshark User Guide: https://www.wireshark.org/docs/
- Isolation Forest: Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." IEEE International Conference on Data Mining.
This project provides a complete, working IDS with documentation and code. You can extend it by adding real-time capture capabilities or integrating more advanced machine learning models. Let me know if you need help with specific enhancements or testing!
Comments
Post a Comment