TruthGuard: AI-Powered Fake News Detector
1. TruthGuard: AI-Powered Fake News Detector
Complete Documentation
From the extracted README and project description:
Fake News Detection in Python
In this project, we have used various natural language processing techniques and machine learning algorithms to classify fake news articles using sci-kit libraries from python.
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Prerequisites
What things you need to install the software and how to install them:
- Python 3.6
- Second and easier option is to download anaconda and use its anaconda prompt to run the commands. To install anaconda check this url https://www.anaconda.com/download/
- You will also need to download and install below 3 packages after you install either python or anaconda from the steps above
- Sklearn (scikit-learn)
- numpy
- scipy
- if you have chosen to install python 3.6 then run below commands in command prompt/terminal to install these packages
pip install -U scikit-learn
pip install numpy
pip install scipy- if you have chosen to install anaconda then run below commands in anaconda prompt to install these packages
conda install -c scikit-learn
conda install -c anaconda numpy
conda install -c anaconda scipyDataset used
The data source used for this project is LIAR dataset which contains 3 files with .tsv format for test, train and validation. Below is some description about the data files used for this classification.
LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION
William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.
the original dataset contained 13 variables/columns for train, test and validation sets as follows:
- Column 1: the ID of the statement ([ID].json).
- Column 2: the label. (Label class contains: True, Mostly-true, Half-true, Barely-true, FALSE, Pants-fire)
- Column 3: the statement.
- Column 4: the subject(s).
- Column 5: the speaker.
- Column 6: the speaker's job title.
- Column 7: the state info.
- Column 8: the party affiliation.
- Column 9-13: the total credit history count, including the current statement.
- 9: barely true counts.
- 10: false counts.
- 11: half true counts.
- 12: mostly true counts.
- 13: pants on fire counts.
- Column 14: the context (venue / location of the speech or statement).
To make things simple we have chosen only 2 variables from this original dataset for this classification. The other variables can be added later to add some more complexity and enhance the features.
Below are the columns used to create 3 datasets that have been in used in this project
- Column 1: Statement (News headline or text).
- Column 2: Label (Label class contains: True, False)
You will see that newly created dataset has only 2 classes as compared to 6 from original classes. Below is method used for reducing the number of classes.
- Original -- New
- True -- True
- Mostly-true -- True
- Half-true -- True
- Barely-true -- False
- False -- False
- Pants-fire -- False
The dataset used for this project were in csv format named train.csv, test.csv and valid.csv and can be found in repo. The original datasets are in "liar" folder in tsv format.
File descriptions
DataPrep.py
This file contains all the pre processing functions needed to process all input documents and texts. First we read the train, test and validation data files then performed some pre processing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc.
FeatureSelection.py
In this file we have performed feature extraction and selection methods from sci-kit learn python libraries. For feature selection, we have used methods like simple bag-of-words and n-grams and then term frequency like tf-tdf weighting. we have also used word2vec and POS tagging to extract the features, though POS tagging and word2vec has not been used at this point in the project.
classifier.py
Here we have build all the classifiers for predicting the fake news detection. The extracted features are fed into different classifiers. We have used Naive-bayes, Logistic Regression, Linear SVM, Stochastic gradient descent, KNN, Decision trees, Random Forest and SVM algorithms to build different models and evaluated their performance.
Results
The results are evaluated based on F1 score and are tabulated in the README of the GitHub repo.
Conclusion
This project demonstrates a practical approach to fake news detection using machine learning, leveraging NLP techniques and classification algorithms. It can be extended with more features or deep learning models for improved accuracy in 2025 trends like real-time news verification.
Complete Code
Below is the complete Python code for the project, combining data preparation, feature selection, and classification. Assume you have train.csv with columns 'text' and 'label' (0 for false, 1 for true). You can download the dataset from Kaggle or the LIAR dataset.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
# Load data
df = pd.read_csv('train.csv') # Assume columns: 'text', 'label'
# Preprocessing function
def preprocess_text(text):
text = text.lower()
tokens = word_tokenize(text)
tokens = [word for word in tokens if word.isalpha()] # Remove punctuation
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
stemmer = PorterStemmer()
tokens = [stemmer.stem(word) for word in tokens]
return ' '.join(tokens)
# Apply preprocessing
df['processed_text'] = df['text'].apply(preprocess_text)
# Split data
X = df['processed_text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature extraction with TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
# Define models
models = {
'Naive Bayes': MultinomialNB(),
'Logistic Regression': LogisticRegression(),
'Linear SVM': LinearSVC(),
'SGD Classifier': SGDClassifier(),
'KNN': KNeighborsClassifier(),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'SVM': SVC()
}
# Train and evaluate
for name, model in models.items():
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='binary')
print(f"{name} - Accuracy: {acc}, F1 Score: {f1}")
print(classification_report(y_test, y_pred))
Comments
Post a Comment