Detecting hate speech against women and immigrants using NLP and deep learning — with an honest look at why models fail.
This project builds and compares four progressively sophisticated models for binary hate speech classification on Twitter, using the SemEval-2019 Task 5 (HatEval) benchmark — one of the most widely cited hate speech datasets in NLP research.
Beyond model building, this project investigates a critical real-world challenge: why hate speech detectors systematically fail when profanity is used casually rather than hatefully. The analysis reveals that models learn lexical shortcuts (profanity → hate) that break down under distribution shift — a finding consistent with the original competition results, where even the winning team achieved only 0.651 macro F1 on the adversarial test set.
The HatEval dataset consists of 12,971 English tweets collected from Twitter, annotated for hate speech directed at women or immigrants:
| Split | Total | Not Hate | Hate | Hate % |
|---|---|---|---|---|
| Train | 9,000 | 5,217 | 3,783 | 42.0% |
| Dev | 1,000 | 573 | 427 | 42.7% |
| Test | 2,971 | 1,719 | 1,252 | 42.1% |
Each tweet includes three labels: HS (hate speech), TR (individual vs group target), and AG (aggressiveness). The primary task uses the HS label for binary classification.
Citation: Basile, V., Bosco, C., Fersini, E., et al. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63.
The project follows a deliberate four-tier progression — from simple bag-of-words to transfer learning — demonstrating the strengths and limitations of each approach:
Standard bag-of-words approach using unigram and bigram TF-IDF features (10K vocabulary) with logistic regression. Establishes a solid starting point.
Combines 30K word-level trigram features with 30K character-level n-grams to capture sub-word patterns and spelling variations. Uses class-weighted logistic regression to address label imbalance.
A bidirectional LSTM with pre-trained GloVe Twitter embeddings (100d, trained on 27B tweets) and a self-attention mechanism. Despite sequence modelling capability, the model overfits on 9K training samples — demonstrating why pre-trained representations are essential for small-dataset NLP.
CardiffNLP's twitter-roberta-base, pre-trained on ~58M tweets, fine-tuned for 3 epochs. Achieves the best performance across all metrics.
| # | Model | Dev F1 | Test F1 |
|---|---|---|---|
| 1 | TF-IDF + Logistic Regression | 0.725 | 0.488 |
| 2 | Word+Char N-grams + Tuned LR | 0.735 | 0.398 |
| 3 | BiLSTM + GloVe + Attention | 0.744 | 0.473 |
| 4 | Twitter-RoBERTa (fine-tuned) | 0.812 | 0.535 |
The dramatic dev-to-test performance drop is not a bug — it's a known property of this benchmark. In the original SemEval-2019 competition (74 teams, 108 submissions):
| Team | Test Macro F1 | Method |
|---|---|---|
| Fermi (1st place) | 0.651 | SVM + Universal Sentence Encoder |
| Panaetius (2nd) | 0.571 | Undisclosed |
| 3rd–5th place | 0.519–0.535 | CNNs, LSTMs |
| This project | 0.535 | Twitter-RoBERTa (fine-tuned) |
Our fine-tuned transformer achieves performance comparable to the top 5 competition entries, despite using a straightforward fine-tuning approach without competition-specific optimisations.
The most important insight from this project is why all models fail on the test set. Investigation reveals a deliberate distribution shift in profanity-hate associations:
| Word | Train: % in Hate | Test: % in Hate | Shift |
|---|---|---|---|
| bitch | 78.1% | 43.5% | −34.6pp |
| cunt | 53.6% | 39.1% | −14.5pp |
| whore | 66.2% | 59.2% | −7.0pp |
What's happening: In training, profanity appears predominantly in hateful tweets. Models learn the shortcut "profanity = hate speech" rather than understanding whether hate is directed at a protected group. The test set deliberately includes casual profanity ("bitch please," "that show is my guilty pleasure") to expose this bias.
The consequence: All models predict 80–92% of test tweets as hateful when only 42% actually are — producing massive false positive rates.
Why this matters for real-world content moderation: Automated systems that rely on keyword-level patterns will systematically over-flag casual language while potentially missing subtle or implicit hate speech that doesn't use explicit slurs. This is a well-documented challenge in the hate speech detection literature and a key consideration for deploying these models in production.
- Python 3.10+
- NVIDIA GPU with CUDA support (recommended for transformer fine-tuning)
git clone https://github.com/pedramebd/hate-speech-detection.git
cd hate-speech-detection
pip install -r requirements.txtpython demo.pyThe demo loads the fine-tuned transformer and classifies sample tweets interactively.
jupyter notebook hate_speech_detection.ipynbThe BiLSTM model requires GloVe Twitter embeddings. Download them separately:
wget https://nlp.stanford.edu/data/glove.twitter.27B.zip
unzip glove.twitter.27B.zipNote: The GloVe zip file is ~1.4 GB. Only
glove.twitter.27B.100d.txtis used.
hate-speech-detection/
├── hate_speech_detection.ipynb # Full pipeline notebook (EDA → modelling → analysis)
├── demo.py # Interactive demo script
├── requirements.txt # Python dependencies
├── README.md
├── LICENSE
├── .gitignore
├── data/
│ ├── train_en.tsv # Training data (9,000 tweets)
│ ├── dev_en.tsv # Development data (1,000 tweets)
│ └── test_en.tsv # Test data (2,971 tweets)
└── figures/
├── label_distribution.png
├── tweet_length_distribution.png
├── top_words_by_class.png
├── baseline_confusion_matrix.png
├── bilstm_training_curves.png
├── model_comparison.png
├── final_model_comparison.png
└── confusion_matrix_comparison.png
| Category | Tools |
|---|---|
| Languages | Python 3.10 |
| Classical ML | Scikit-learn (TF-IDF, Logistic Regression, LinearSVC) |
| Deep Learning | PyTorch 2.x |
| NLP | HuggingFace Transformers, GloVe Embeddings |
| Pre-trained Model | cardiffnlp/twitter-roberta-base |
| Visualisation | Matplotlib, Seaborn |
| Hardware | NVIDIA RTX 4060 (CUDA) |
-
Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F.M., Rosso, P. & Sanguinetti, M. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63.
-
Indurthi, V., Syed, B., Shrivastava, M., Chakravartula, N., Guber, M. & Varma, V. (2019). FERMI at SemEval-2019 Task 5: Using Sentence Embeddings to Identify Hate Speech Against Immigrants and Women in Twitter. SemEval-2019.
-
Barbieri, F., Camacho-Collados, J., Neves, L. & Espinosa-Anke, L. (2020). TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of EMNLP 2020.
-
Pennington, J., Socher, R. & Manning, C.D. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014.
Pedram Ebadollahyvahed — MSc Data Science, Cardiff University (2025–2026)
This project is licensed under the MIT License — see the LICENSE file for details.



