Hate Speech Detection on Twitter

Detecting hate speech against women and immigrants using NLP and deep learning — with an honest look at why models fail.

Overview

This project builds and compares four progressively sophisticated models for binary hate speech classification on Twitter, using the SemEval-2019 Task 5 (HatEval) benchmark — one of the most widely cited hate speech datasets in NLP research.

Beyond model building, this project investigates a critical real-world challenge: why hate speech detectors systematically fail when profanity is used casually rather than hatefully. The analysis reveals that models learn lexical shortcuts (profanity → hate) that break down under distribution shift — a finding consistent with the original competition results, where even the winning team achieved only 0.651 macro F1 on the adversarial test set.

Dataset

The HatEval dataset consists of 12,971 English tweets collected from Twitter, annotated for hate speech directed at women or immigrants:

Split	Total	Not Hate	Hate	Hate %
Train	9,000	5,217	3,783	42.0%
Dev	1,000	573	427	42.7%
Test	2,971	1,719	1,252	42.1%

Each tweet includes three labels: HS (hate speech), TR (individual vs group target), and AG (aggressiveness). The primary task uses the HS label for binary classification.

Citation: Basile, V., Bosco, C., Fersini, E., et al. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63.

Model Progression

The project follows a deliberate four-tier progression — from simple bag-of-words to transfer learning — demonstrating the strengths and limitations of each approach:

1. TF-IDF + Logistic Regression (Baseline)

Standard bag-of-words approach using unigram and bigram TF-IDF features (10K vocabulary) with logistic regression. Establishes a solid starting point.

2. Word & Character N-grams + Tuned LR (Enhanced Classical ML)

Combines 30K word-level trigram features with 30K character-level n-grams to capture sub-word patterns and spelling variations. Uses class-weighted logistic regression to address label imbalance.

3. BiLSTM + GloVe + Attention (Custom Deep Learning)

A bidirectional LSTM with pre-trained GloVe Twitter embeddings (100d, trained on 27B tweets) and a self-attention mechanism. Despite sequence modelling capability, the model overfits on 9K training samples — demonstrating why pre-trained representations are essential for small-dataset NLP.

4. Fine-tuned Twitter-RoBERTa (Pre-trained Transformer)

CardiffNLP's twitter-roberta-base, pre-trained on ~58M tweets, fine-tuned for 3 epochs. Achieves the best performance across all metrics.

Results

#	Model	Dev F1	Test F1
1	TF-IDF + Logistic Regression	0.725	0.488
2	Word+Char N-grams + Tuned LR	0.735	0.398
3	BiLSTM + GloVe + Attention	0.744	0.473
4	Twitter-RoBERTa (fine-tuned)	0.812	0.535

Competition Context

The dramatic dev-to-test performance drop is not a bug — it's a known property of this benchmark. In the original SemEval-2019 competition (74 teams, 108 submissions):

Team	Test Macro F1	Method
Fermi (1st place)	0.651	SVM + Universal Sentence Encoder
Panaetius (2nd)	0.571	Undisclosed
3rd–5th place	0.519–0.535	CNNs, LSTMs
This project	0.535	Twitter-RoBERTa (fine-tuned)

Our fine-tuned transformer achieves performance comparable to the top 5 competition entries, despite using a straightforward fine-tuning approach without competition-specific optimisations.

Key Finding: The Lexical Bias Problem

The most important insight from this project is why all models fail on the test set. Investigation reveals a deliberate distribution shift in profanity-hate associations:

Word	Train: % in Hate	Test: % in Hate	Shift
bitch	78.1%	43.5%	−34.6pp
cunt	53.6%	39.1%	−14.5pp
whore	66.2%	59.2%	−7.0pp

What's happening: In training, profanity appears predominantly in hateful tweets. Models learn the shortcut "profanity = hate speech" rather than understanding whether hate is directed at a protected group. The test set deliberately includes casual profanity ("bitch please," "that show is my guilty pleasure") to expose this bias.

The consequence: All models predict 80–92% of test tweets as hateful when only 42% actually are — producing massive false positive rates.

Why this matters for real-world content moderation: Automated systems that rely on keyword-level patterns will systematically over-flag casual language while potentially missing subtle or implicit hate speech that doesn't use explicit slurs. This is a well-documented challenge in the hate speech detection literature and a key consideration for deploying these models in production.

Quick Start

Prerequisites

Python 3.10+
NVIDIA GPU with CUDA support (recommended for transformer fine-tuning)

Installation

git clone https://github.com/pedramebd/hate-speech-detection.git
cd hate-speech-detection
pip install -r requirements.txt

Run the Demo

python demo.py

The demo loads the fine-tuned transformer and classifies sample tweets interactively.

Run the Full Notebook

jupyter notebook hate_speech_detection.ipynb

GloVe Embeddings (for BiLSTM)

The BiLSTM model requires GloVe Twitter embeddings. Download them separately:

wget https://nlp.stanford.edu/data/glove.twitter.27B.zip
unzip glove.twitter.27B.zip

Note: The GloVe zip file is ~1.4 GB. Only glove.twitter.27B.100d.txt is used.

Project Structure

hate-speech-detection/
├── hate_speech_detection.ipynb   # Full pipeline notebook (EDA → modelling → analysis)
├── demo.py                       # Interactive demo script
├── requirements.txt              # Python dependencies
├── README.md
├── LICENSE
├── .gitignore
├── data/
│   ├── train_en.tsv              # Training data (9,000 tweets)
│   ├── dev_en.tsv                # Development data (1,000 tweets)
│   └── test_en.tsv               # Test data (2,971 tweets)
└── figures/
    ├── label_distribution.png
    ├── tweet_length_distribution.png
    ├── top_words_by_class.png
    ├── baseline_confusion_matrix.png
    ├── bilstm_training_curves.png
    ├── model_comparison.png
    ├── final_model_comparison.png
    └── confusion_matrix_comparison.png

Tech Stack

Category	Tools
Languages	Python 3.10
Classical ML	Scikit-learn (TF-IDF, Logistic Regression, LinearSVC)
Deep Learning	PyTorch 2.x
NLP	HuggingFace Transformers, GloVe Embeddings
Pre-trained Model	cardiffnlp/twitter-roberta-base
Visualisation	Matplotlib, Seaborn
Hardware	NVIDIA RTX 4060 (CUDA)

References

Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F.M., Rosso, P. & Sanguinetti, M. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63.
Indurthi, V., Syed, B., Shrivastava, M., Chakravartula, N., Guber, M. & Varma, V. (2019). FERMI at SemEval-2019 Task 5: Using Sentence Embeddings to Identify Hate Speech Against Immigrants and Women in Twitter. SemEval-2019.
Barbieri, F., Camacho-Collados, J., Neves, L. & Espinosa-Anke, L. (2020). TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of EMNLP 2020.
Pennington, J., Socher, R. & Manning, C.D. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014.

Author

Pedram Ebadollahyvahed — MSc Data Science, Cardiff University (2025–2026)

GitHub · LinkedIn

License

This project is licensed under the MIT License — see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hate Speech Detection on Twitter

Overview

Dataset

Model Progression

1. TF-IDF + Logistic Regression (Baseline)

2. Word & Character N-grams + Tuned LR (Enhanced Classical ML)

3. BiLSTM + GloVe + Attention (Custom Deep Learning)

4. Fine-tuned Twitter-RoBERTa (Pre-trained Transformer)

Results

Competition Context

Key Finding: The Lexical Bias Problem

Quick Start

Prerequisites

Installation

Run the Demo

Run the Full Notebook

GloVe Embeddings (for BiLSTM)

Project Structure

Tech Stack

References

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
hate_speech_detection.ipynb		hate_speech_detection.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hate Speech Detection on Twitter

Overview

Dataset

Model Progression

1. TF-IDF + Logistic Regression (Baseline)

2. Word & Character N-grams + Tuned LR (Enhanced Classical ML)

3. BiLSTM + GloVe + Attention (Custom Deep Learning)

4. Fine-tuned Twitter-RoBERTa (Pre-trained Transformer)

Results

Competition Context

Key Finding: The Lexical Bias Problem

Quick Start

Prerequisites

Installation

Run the Demo

Run the Full Notebook

GloVe Embeddings (for BiLSTM)

Project Structure

Tech Stack

References

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages