Skip to content

pedramebd/hate-speech-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hate Speech Detection on Twitter

Detecting hate speech against women and immigrants using NLP and deep learning — with an honest look at why models fail.

Python PyTorch Transformers scikit-learn License


Overview

This project builds and compares four progressively sophisticated models for binary hate speech classification on Twitter, using the SemEval-2019 Task 5 (HatEval) benchmark — one of the most widely cited hate speech datasets in NLP research.

Beyond model building, this project investigates a critical real-world challenge: why hate speech detectors systematically fail when profanity is used casually rather than hatefully. The analysis reveals that models learn lexical shortcuts (profanity → hate) that break down under distribution shift — a finding consistent with the original competition results, where even the winning team achieved only 0.651 macro F1 on the adversarial test set.

Model Comparison


Dataset

The HatEval dataset consists of 12,971 English tweets collected from Twitter, annotated for hate speech directed at women or immigrants:

Split Total Not Hate Hate Hate %
Train 9,000 5,217 3,783 42.0%
Dev 1,000 573 427 42.7%
Test 2,971 1,719 1,252 42.1%

Each tweet includes three labels: HS (hate speech), TR (individual vs group target), and AG (aggressiveness). The primary task uses the HS label for binary classification.

Label Distribution

Citation: Basile, V., Bosco, C., Fersini, E., et al. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63.


Model Progression

The project follows a deliberate four-tier progression — from simple bag-of-words to transfer learning — demonstrating the strengths and limitations of each approach:

1. TF-IDF + Logistic Regression (Baseline)

Standard bag-of-words approach using unigram and bigram TF-IDF features (10K vocabulary) with logistic regression. Establishes a solid starting point.

2. Word & Character N-grams + Tuned LR (Enhanced Classical ML)

Combines 30K word-level trigram features with 30K character-level n-grams to capture sub-word patterns and spelling variations. Uses class-weighted logistic regression to address label imbalance.

3. BiLSTM + GloVe + Attention (Custom Deep Learning)

A bidirectional LSTM with pre-trained GloVe Twitter embeddings (100d, trained on 27B tweets) and a self-attention mechanism. Despite sequence modelling capability, the model overfits on 9K training samples — demonstrating why pre-trained representations are essential for small-dataset NLP.

BiLSTM Training Curves

4. Fine-tuned Twitter-RoBERTa (Pre-trained Transformer)

CardiffNLP's twitter-roberta-base, pre-trained on ~58M tweets, fine-tuned for 3 epochs. Achieves the best performance across all metrics.


Results

# Model Dev F1 Test F1
1 TF-IDF + Logistic Regression 0.725 0.488
2 Word+Char N-grams + Tuned LR 0.735 0.398
3 BiLSTM + GloVe + Attention 0.744 0.473
4 Twitter-RoBERTa (fine-tuned) 0.812 0.535

Competition Context

The dramatic dev-to-test performance drop is not a bug — it's a known property of this benchmark. In the original SemEval-2019 competition (74 teams, 108 submissions):

Team Test Macro F1 Method
Fermi (1st place) 0.651 SVM + Universal Sentence Encoder
Panaetius (2nd) 0.571 Undisclosed
3rd–5th place 0.519–0.535 CNNs, LSTMs
This project 0.535 Twitter-RoBERTa (fine-tuned)

Our fine-tuned transformer achieves performance comparable to the top 5 competition entries, despite using a straightforward fine-tuning approach without competition-specific optimisations.


Key Finding: The Lexical Bias Problem

The most important insight from this project is why all models fail on the test set. Investigation reveals a deliberate distribution shift in profanity-hate associations:

Word Train: % in Hate Test: % in Hate Shift
bitch 78.1% 43.5% −34.6pp
cunt 53.6% 39.1% −14.5pp
whore 66.2% 59.2% −7.0pp

What's happening: In training, profanity appears predominantly in hateful tweets. Models learn the shortcut "profanity = hate speech" rather than understanding whether hate is directed at a protected group. The test set deliberately includes casual profanity ("bitch please," "that show is my guilty pleasure") to expose this bias.

The consequence: All models predict 80–92% of test tweets as hateful when only 42% actually are — producing massive false positive rates.

Confusion Matrix Comparison

Why this matters for real-world content moderation: Automated systems that rely on keyword-level patterns will systematically over-flag casual language while potentially missing subtle or implicit hate speech that doesn't use explicit slurs. This is a well-documented challenge in the hate speech detection literature and a key consideration for deploying these models in production.


Quick Start

Prerequisites

  • Python 3.10+
  • NVIDIA GPU with CUDA support (recommended for transformer fine-tuning)

Installation

git clone https://github.com/pedramebd/hate-speech-detection.git
cd hate-speech-detection
pip install -r requirements.txt

Run the Demo

python demo.py

The demo loads the fine-tuned transformer and classifies sample tweets interactively.

Run the Full Notebook

jupyter notebook hate_speech_detection.ipynb

GloVe Embeddings (for BiLSTM)

The BiLSTM model requires GloVe Twitter embeddings. Download them separately:

wget https://nlp.stanford.edu/data/glove.twitter.27B.zip
unzip glove.twitter.27B.zip

Note: The GloVe zip file is ~1.4 GB. Only glove.twitter.27B.100d.txt is used.


Project Structure

hate-speech-detection/
├── hate_speech_detection.ipynb   # Full pipeline notebook (EDA → modelling → analysis)
├── demo.py                       # Interactive demo script
├── requirements.txt              # Python dependencies
├── README.md
├── LICENSE
├── .gitignore
├── data/
│   ├── train_en.tsv              # Training data (9,000 tweets)
│   ├── dev_en.tsv                # Development data (1,000 tweets)
│   └── test_en.tsv               # Test data (2,971 tweets)
└── figures/
    ├── label_distribution.png
    ├── tweet_length_distribution.png
    ├── top_words_by_class.png
    ├── baseline_confusion_matrix.png
    ├── bilstm_training_curves.png
    ├── model_comparison.png
    ├── final_model_comparison.png
    └── confusion_matrix_comparison.png

Tech Stack

Category Tools
Languages Python 3.10
Classical ML Scikit-learn (TF-IDF, Logistic Regression, LinearSVC)
Deep Learning PyTorch 2.x
NLP HuggingFace Transformers, GloVe Embeddings
Pre-trained Model cardiffnlp/twitter-roberta-base
Visualisation Matplotlib, Seaborn
Hardware NVIDIA RTX 4060 (CUDA)

References

  1. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F.M., Rosso, P. & Sanguinetti, M. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63.

  2. Indurthi, V., Syed, B., Shrivastava, M., Chakravartula, N., Guber, M. & Varma, V. (2019). FERMI at SemEval-2019 Task 5: Using Sentence Embeddings to Identify Hate Speech Against Immigrants and Women in Twitter. SemEval-2019.

  3. Barbieri, F., Camacho-Collados, J., Neves, L. & Espinosa-Anke, L. (2020). TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of EMNLP 2020.

  4. Pennington, J., Socher, R. & Manning, C.D. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014.


Author

Pedram Ebadollahyvahed — MSc Data Science, Cardiff University (2025–2026)

GitHub · LinkedIn


License

This project is licensed under the MIT License — see the LICENSE file for details.

About

Hate speech detection on Twitter using NLP and deep learning (SemEval-2019 HatEval). Compares TF-IDF, BiLSTM+GloVe, and fine-tuned Twitter-RoBERTa — with analysis of the lexical bias problem in profanity-based classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors