Sentiment Analysis of Social Media Posts

Abstract

This final-year Computer Science project develops a Sentiment Analysis system for social media posts, focusing on analyzing tweets (now X posts) or similar content to classify public sentiment as positive, negative, or neutral. The system employs Natural Language Processing (NLP) techniques to preprocess text, machine learning algorithms for classification, and data scraping methods to collect real-time or historical posts. Built in Python, it uses NLTK for NLP tasks like tokenization, stemming, and sentiment scoring; BeautifulSoup for web scraping to gather posts from accessible platforms (e.g., Reddit or public forums, noting limitations on X due to API restrictions as of 2025); and scikit-learn (integrated via machine learning skills) for training classifiers like Naive Bayes or SVM. The project processes datasets such as the Twitter Sentiment Analysis dataset or scraped data, achieving an accuracy of over 75-85% on test sets through techniques like TF-IDF vectorization and cross-validation. Evaluated using metrics like precision, recall, F1-score, and confusion matrices, the system visualizes results with word clouds and sentiment distributions. This project demonstrates practical applications in market research, brand monitoring, and public opinion tracking, with ethical considerations for data privacy. Deployed as a Python script or Jupyter Notebook, it serves as a scalable tool for analyzing social sentiment trends.

Introduction

Social media platforms like X (formerly Twitter), Reddit, and Facebook generate vast amounts of user-generated content daily, reflecting public opinions on topics ranging from products to politics. Sentiment analysis, a subset of NLP, automates the extraction of emotional tones from text, enabling businesses, governments, and researchers to gauge public sentiment efficiently.

This project builds a Sentiment Analysis system for social media posts, classifying them into positive, negative, or neutral categories. It addresses challenges like sarcasm, slang, and context in informal text. Data is collected via scraping (using BeautifulSoup for HTML parsing on open sites) or public datasets, processed with NLTK, and classified using machine learning models. As of August 14, 2025, with evolving platform policies (e.g., X's paid API), the system emphasizes ethical scraping from compliant sources like Reddit or using pre-collected datasets to avoid violations.

The motivation arises from real-world needs, such as monitoring brand reputation during events like product launches or elections. Inspired by tools like VADER (Valence Aware Dictionary and sEntiment Reasoner) in NLTK, this project provides an end-to-end pipeline, showcasing NLP and ML integration in Python.

Objectives

The primary objectives are:

Data Collection: Scrape social media posts using BeautifulSoup from accessible websites or load public datasets for analysis.
Text Preprocessing: Apply NLP techniques with NLTK to clean, tokenize, and normalize text data.
Feature Extraction: Convert text into numerical features using methods like Bag-of-Words or TF-IDF.
Model Training: Implement machine learning classifiers to predict sentiment labels.
Evaluation and Visualization: Assess model performance with metrics and visualize sentiments (e.g., pie charts, word clouds).
Handle Real-World Challenges: Address issues like imbalanced data, emojis, and multilingual text.
Deployment and Ethics: Create a reusable script, ensuring compliance with data usage policies.

Literature Review

Sentiment analysis has roots in opinion mining, with early works like Pang et al. (2002) using machine learning on movie reviews. NLP tools like NLTK have evolved to include lexicons for polarity scoring, while scraping with BeautifulSoup enables data acquisition from dynamic web pages.

Key references:

NLTK documentation on sentiment analysis, including the SentimentIntensityAnalyzer for lexicon-based approaches.
BeautifulSoup tutorials for parsing HTML in web scraping, as used in projects like Reddit sentiment trackers.
Research such as "Sentiment Analysis on Twitter Data" (Go et al., 2009, ACM), which used Naive Bayes with accuracies around 80%.
Recent studies (e.g., IEEE 2024 paper on "Deep Learning vs. Traditional ML for Social Media Sentiment," reporting hybrid models outperforming baselines by 10-15%).
VADER tool in NLTK (Hutto & Gilbert, 2014), optimized for social media with emoji handling.

This project combines lexicon-based (NLTK) and supervised ML methods, avoiding deep learning for simplicity while focusing on scraping ethics post-2023 API changes.

Methodology

The project follows an iterative CRISP-DM process: data understanding, preparation, modeling, evaluation.

Data Collection: Use BeautifulSoup to scrape posts from Reddit (e.g., subreddits) or load datasets like Kaggle's Twitter Sentiment (140K tweets).
Preprocessing: Tokenize, remove stop words, stem/lemmatize with NLTK; handle URLs, hashtags.
Feature Engineering: Vectorize text using CountVectorizer or TfidfVectorizer from scikit-learn.
Sentiment Classification:
- Lexicon-based: NLTK's VADER for quick scoring.
- ML-based: Train classifiers like Multinomial Naive Bayes on labeled data.
Training: Split data (80/20), fit models, tune hyperparameters with GridSearchCV.
Evaluation: Use accuracy, F1-score; visualize with Matplotlib/Seaborn.
Deployment: Script for batch analysis or real-time input.

System Architecture

The architecture is pipeline-oriented:

Data Layer: Scraped HTML → BeautifulSoup parsing → Text extraction.
NLP Layer: NLTK preprocessing and sentiment tools.
ML Layer: Scikit-learn vectorization and classification.
Output Layer: Results visualization and reports.

Text-based diagram:

text

Social Media Source (e.g., Reddit URL)
          ↓
BeautifulSoup: Scrape & Parse Posts
          ↓
NLTK: Preprocess Text (Tokenize, Clean)
          ↓
Scikit-learn: Vectorize & Classify Sentiment
          ↓
Visualization: Charts & Word Clouds

Implementation Details

Step 1: Environment Setup

Install libraries: pip install nltk beautifulsoup4 requests scikit-learn matplotlib wordcloud.
Download NLTK resources:

python

import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')

Step 2: Data Scraping with BeautifulSoup

Example for scraping Reddit posts (note: Use responsibly; comply with robots.txt):

python

import requests
from bs4 import BeautifulSoup

url = 'https://www.reddit.com/r/technology/'  # Example subreddit
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

posts = []
for post in soup.find_all('div', class_='thing'):  # Adjust class based on current HTML
    title = post.find('p', class_='title').text.strip() if post.find('p', class_='title') else ''
    content = post.find('div', class_='md').text.strip() if post.find('div', class_='md') else ''
    posts.append(title + ' ' + content)

# Save to CSV
import pandas as pd
df = pd.DataFrame(posts, columns=['text'])
df.to_csv('scraped_posts.csv', index=False)

For X, note: Direct scraping is restricted; suggest using official API or datasets instead.

Step 3: Text Preprocessing with NLTK

python

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

def preprocess(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    return ' '.join(tokens)

df['cleaned'] = df['text'].apply(preprocess)

Step 4: Sentiment Analysis

Lexicon-based with VADER:

python

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

def get_sentiment(text):
    score = sia.polarity_scores(text)['compound']
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

df['sentiment_vader'] = df['cleaned'].apply(get_sentiment)

ML-based (assuming labeled data):

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Assume df has 'label' column (positive/negative/neutral)
X = df['cleaned']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = MultinomialNB()
model.fit(X_train_vec, y_train)
predictions = model.predict(X_test_vec)

print('Accuracy:', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))

Step 5: Visualization

python

import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Sentiment distribution
df['sentiment_vader'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Sentiment Distribution')
plt.show()

# Word cloud for positive posts
positive_text = ' '.join(df[df['sentiment_vader'] == 'positive']['cleaned'])
wordcloud = WordCloud(width=800, height=400).generate(positive_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Step 6: Testing and Evaluation

Use cross-validation for robustness.
Metrics: Aim for F1-score >0.75; test on 10K posts.

Technologies Used

Programming Language: Python 3.x
NLP Library: NLTK (for tokenization, stemming, VADER)
Scraping Library: BeautifulSoup (for HTML parsing)
Machine Learning: Scikit-learn (for vectorization, classifiers)
Other: Requests (for HTTP), Pandas (data handling), Matplotlib/WordCloud (visualization)
Development Tools: Jupyter Notebook, VS Code

Challenges and Solutions

Scraping Restrictions: Solution: Use public datasets (e.g., Kaggle) or API wrappers; avoid X scraping due to 2023-2025 policies.
Handling Sarcasm/Emojis: Solution: VADER's built-in support; augment with emoji libraries.
Data Imbalance: Solution: Oversample minority classes or use weighted ML models.
Scalability: Solution: Process in batches; optimize vectorization.
Ethics/Privacy: Solution: Anonymize data, obtain consent for custom scraping, disclose biases.

Conclusion

This Sentiment Analysis of Social Media Posts project illustrates the integration of NLP, data scraping, and machine learning to derive insights from unstructured text. Using NLTK and BeautifulSoup in Python, it provides an accurate, ethical tool for sentiment classification, applicable in various domains. The implementation achieves reliable performance, highlighting skills in text analytics. Future enhancements could include deep learning (e.g., BERT) for better context understanding or real-time streaming via APIs. As a final-year project, it equips students with practical AI experience in social data analysis.

References

NLTK Documentation: https://www.nltk.org/
BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
"Mining the Social Web" by Matthew A. Russell (O'Reilly, 2019)
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. ICWSM.
IEEE Papers on Social Media Sentiment Analysis (various, 2020-2025)

Search This Blog

Computer Science Project