Skip to content

Add MediaWiki diff enrichment worker service#7

Open
artvandelay wants to merge 1 commit intomainfrom
codex/create-worker-service-for-mediawiki-diffs
Open

Add MediaWiki diff enrichment worker service#7
artvandelay wants to merge 1 commit intomainfrom
codex/create-worker-service-for-mediawiki-diffs

Conversation

@artvandelay
Copy link
Copy Markdown
Owner

Motivation

  • Add a background worker to consume stored events and enrich them with MediaWiki diffs and page metadata for downstream analysis and political scoring.
  • Extract structured added/removed text and surrounding context from MediaWiki HTML diffs for reliable artifact storage.
  • Cache page-level metadata (categories and Wikidata QID) with TTL to avoid repeated API calls.
  • Handle API rate limits and transient errors with retry/backoff to improve robustness.

Description

  • Add a new worker EnrichWorker in services/enrich/worker.py that fetches pending events, obtains diffs with action=compare, parses HTML, persists diff records, and triggers metadata fetching via MetadataFetcher.
  • Implement parse_diff_html in services/enrich/diff_parser.py which converts MediaWiki HTML diffs into DiffFragment objects containing added_text, removed_text, and context.
  • Provide SQLite-backed storage DiffStorage in services/enrich/storage.py that creates events, diffs, and metadata_cache tables and exposes methods to fetch/mark events, insert diffs, and upsert/get cached metadata.
  • Add HttpClient with retry and exponential backoff behavior in services/enrich/http_client.py and MetadataFetcher in services/enrich/metadata.py that fetches categories and wikibase_item QID and caches results with a TTL.

Testing

  • Run the diff parser unit test with python -m unittest tests/test_diff_parser.py which executed the parse_diff_html parsing scenarios and succeeded.
  • The test suite ran 1 unit test (tests/test_diff_parser.py) and passed (OK).

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant