Skip to content

telota/gendertagger

Repository files navigation

genderTagger

A playful RSE tool for gender annotation in Digital Humanities projects.

genderTagger is a web-based application for gender classification in TEI-XML person registers. It combines multiple automated classification methods with a gamified web interface for human annotation, developed by TELOTA's KI-Lab and the Gender & Data initiative at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW).

Status: The current codebase contains a Streamlit-based proof of concept and standalone implementations of all classification methods used for initial testing and benchmarking. The project is now being refactored into a production architecture with a FastAPI backend and a Vue.js frontend.

Overview

Person registers are the backbone of digital scholarly editions — they link actors, create context, and open perspectives on social and intellectual networks. Yet gender information is often missing, making the Gender Data Gap invisible. genderTagger addresses this by providing:

  • 8 classification methods ranging from authority file lookups to transformer models and local LLMs
  • A gamified web UI with browse/annotate mode, a timed game mode with scoring and leaderboards, and statistics dashboards
  • TEI-XML import/export so enriched data flows back into existing edition workflows
  • Benchmark tooling to evaluate and compare classifier performance

The primary development case study is schleiermacher digital, whose person register contains ~8,000 entries.

Quick Start

  1. Install dependencies:

    pip install -r requirements.txt
  2. Launch the prototype app:

    streamlit run src/gendertagger/prototype/app.py
  3. Import your data: go to SettingsImport Person Register (default path: data/raw/schleiermacher/Register/Personen)

  4. Start annotating: use Browse & Annotate for systematic work or Game Mode for speed annotation with scoring.

See docs/QUICK_START.md for the full quick-start guide and docs/UI_GUIDE.md for detailed usage instructions.

Classification Methods

genderTagger ships with nine independent classification approaches:

Method Type Source / Model
GND API Authority file lookup lobid-gnd (Gemeinsame Normdatei)
GND Local Lookup Authority file (offline) Pre-built SQLite DB from GND MARC XML
gender-guesser Rule-based heuristic gender-guesser library
nomquamgender Statistical ML nomquamgender library
HF Gender-Classification DistilBERT transformer padmajabfrl/Gender-Classification
HF Genderize BERT transformer imranali291/genderize
HIVOTO Historical name lookup HIVOTO v1.0.0 (Historisches Vornamentool)
HIVOTO-XGBoost XGBoost + char n-gram TF-IDF Trained on HIVOTO dataset
LLM (Ollama) Local large language model Configurable model via Ollama REST API

Five of these classifiers (gender-guesser, nomquamgender, HF Gender-Classification, GND API, and optionally Ollama LLM) are integrated directly into the web UI to assist human annotators.

All nine methods were first developed and tested as standalone scripts (in methods/) to validate their accuracy and coverage independently before integration.

Architecture

Current: Streamlit Prototype

A Streamlit proof-of-concept (src/gendertagger/prototype/) was built to explore the annotation workflow and gamification features. It offers five modes:

  • Browse & Annotate — filter by status, gender, or name; view classifier predictions per person; annotate with one click (Male / Female / Institution / Uncertain)
  • Game Mode — timed annotation with score tracking, streak bonuses, and three difficulty levels (Easy: 10, Medium: 20, Hard: 50 questions)
  • Statistics — annotation progress and gender distribution charts
  • Leaderboard — top 10 scores for team competition
  • Settings — TEI-XML import, CSV/JSON export, classifier management

Planned: FastAPI + Vue.js

The project is being refactored into a decoupled client-server architecture:

  • Backend (src/gendertagger/backend/) — a FastAPI REST API exposing classification endpoints, TEI-XML import/export, annotation management, and user/session handling. All nine classifiers will be accessible through a unified API.
  • Frontend (src/gendertagger/frontend/) — a Vue.js single-page application providing the gamified annotation UI, statistics dashboards, and team leaderboards.

This separation enables independent scaling, easier testing, and a richer interactive frontend.

Project Structure

gendertagger/
├── data/
│   ├── output/                         # Outputs and annotation database
│   │   ├── annotations.db              # SQLite annotation database
│   │   └── results/                    # Classification results (9 CSVs)
│   ├── processed/                      # Processed datasets (e.g. persons CSV)
│   └── raw/                            # Raw source data
│       ├── GND/                        # GND authority MARC XML
│       ├── HIVOTO/                     # HIVOTO v1.0.0 dataset
│       ├── schleiermacher/             # TEI-XML edition data
│       │   ├── Briefe/                 # Letters (1774–1834)
│       │   ├── Register/              # Person/place/work registers
│       │   └── Thesaurus/             # SKOS/RDF thesaurus
│       └── wiki_gendersort/           # Wiki-Gendersort name database
├── docs/                               # Documentation
│   └── references/                     # Reference docs for tools and APIs
├── methods/                            # 9 standalone classification scripts (test implementations)
├── notebooks/                          # Jupyter notebooks for analysis
├── scripts/                            # Benchmark, agreement and poster scripts
├── src/
│   └── gendertagger/                   # Main Python package
│       ├── backend/                    # Backend logic
│       │   ├── classification/         # Classifier implementations
│       │   ├── config/                 # Configuration
│       │   ├── data/                   # TEI-XML parsing and enrichment
│       │   ├── llm/                    # LLM integration (prompts, schemas)
│       │   ├── transformations/        # XSLT transformations
│       │   └── utils/                  # Utility functions
│       ├── frontend/                   # Vue.js frontend (planned)
│       └── prototype/                  # Streamlit proof-of-concept app
├── pyproject.toml
├── requirements.txt
├── requirements-dev.txt
├── LICENSE                             # MIT
└── README.md

Benchmark and Evaluation

The scripts/ directory provides tools for systematic evaluation:

  • benchmark.py — compares all nine methods against GND as ground truth; computes coverage, classification distributions, and accuracy metrics
  • agreements.py — inter-annotator agreement: pairwise Cohen's κ, Fleiss' κ, Krippendorff's α
  • coverage.py — coverage and classification distribution report across three tiers (binary, substantive, overall)

Poster visualisation scripts (poster_benchmark_chart.py, poster_pies.py, poster_example_badges.py) generate figures for presentations.

Installation

Requirements

  • Python 3.10+
  • Core packages: streamlit, pandas, plotly, gender-guesser, nomquamgender, transformers, torch, requests

Setup

  1. Clone the repository:

    git clone https://github.com/telota/gendertagger.git
    cd gendertagger
  2. Install dependencies:

    pip install -r requirements.txt
  3. Launch the Streamlit prototype:

    streamlit run src/gendertagger/prototype/app.py

For the benchmark scripts, additional packages may be required: xgboost, scikit-learn, krippendorff, statsmodels, seaborn, matplotlib.

Limitations

  • Gender classification is based on historical names and uses a binary scheme (Male / Female) plus the categories Uncertain and Institution. Non-binary, trans*, or agender categories are not currently supported.
  • Name-based classification has inherent limitations and cultural biases.
  • Classifications represent a statistical estimate and do not reflect individual gender identity.
  • Results require validation, especially for ambiguous or culturally diverse names.

Research Context

This work was presented at deRSE26 (German Conference on Research Software Engineering, University of Stuttgart, March 3–5, 2026). The submitted abstract is available here.

Contributing

This project is part of ongoing research. For questions or collaboration inquiries, please open an issue or contact the project team.

License

This project is licensed under the MIT License.

Copyright (c) 2026 TELOTA – The Electronic Life Of The Academy.

Acknowledgments

Developed by the KI-Lab and Gender & Data initiative at TELOTA, Berlin-Brandenburg Academy of Sciences and Humanities (BBAW).

Further Documentation

About

GENDERTAGGER - A playful RSE tool for gender annotation in Digital Humanities projects

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors