wmde · philippesaade-wmde · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -0,0 +1,26 @@
+name: "Ruff Lint"
+
+on:
+  pull_request:
+    branches: ["main"]
+
+permissions:
+  contents: read
+
+jobs:
+  ruff:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+
+      - name: Set up Python
+        run: uv python install
+
+      - name: Run Ruff linter
+        run: uv run ruff check .
+
+      - name: Run Ruff formatter check
+        run: uv run ruff format --check .
diff --git a/README.md b/README.md
@@ -1,81 +1,47 @@
 # Wikidata Textifier
 
-**Wikidata Textifier** is an API that transforms Wikidata items into compact format for use in LLMs and GenAI applications. It resolves missing labels of properties and claim values by querying the Wikidata Action API, making it efficient and suitable for AI pipelines.
+**Wikidata Textifier** is an API that transforms Wikidata entities into compact outputs for LLM and GenAI use cases.
+It resolves missing labels for properties and claim values using the Wikidata Action API and caches labels to reduce repeated lookups.
 
-🔗 Live API: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
+Live API: [wd-textify.wmcloud.org](https://wd-textify.wmcloud.org/) \
+API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)
 
----
+## Features
 
-## Functionalities
+- Textify Wikidata entities as `json`, `text`, or `triplet`.
+- Resolve labels for linked entities and properties.
+- Cache labels in MariaDB for faster repeated requests.
+- Support multilingual output with fallback language support.
+- Avoid SPARQL and use Wikidata Action API / EntityData endpoints.
 
-- **Textifies** any Wikidata item into a readable or JSON format suitable for LLMs.
-- **Resolves all labels**, including those missing when querying the Wikidata API.
-- **Caches labels** for 90 days to boost performance and reduce API load.
-- **Avoids SPARQL** and uses the Wikidata Action API for better efficiency and compatibility.
-- **Hosted on Toolforge**: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
+## Output Formats
 
----
+- `json`: Structured representation with claims (and optionally qualifiers/references).
+- `text`: Readable summary including label, description, aliases, and attributes.
+- `triplet`: Triplet-style lines with labels and IDs for graph-style traversal.
 
-## Formats
-
-- **Text**: A textual representation or summary of the Wikidata item, including its label, description, aliases, and claims. Useful for helping LLMs understand what the item represents.
-- **Triplet**: Outputs each triplet as a structured line, including labels and IDs, but omits descriptions and aliases. Ideal for agentic LLMs to traverse and explore Wikidata.
-- **JSON**: A structured and compact representation of the full item, suitable for custom formats.
-
----
-
-## API Usage
+## API
 
 ### `GET /`
 
-#### Query Parameters
-
-| Name           | Type    | Required | Description                                                                 |
-|----------------|---------|----------|-----------------------------------------------------------------------------|
-| `id`           | string  | Yes      | Wikidata item ID (e.g., `Q42`)                                              |
-| `lang`         | string  | No       | Language code for labels (default: `en`)                                   |
-| `format`         | string    | No       | The format of the response, either 'json', 'text', or 'triplet' (default: `json`) |
-| `external_ids` | bool    | No       | Whether to include external IDs in the output (default: `true`)            |
-| `all_ranks` | bool    | No       | If false, returns ranked preferred statements, falling back to normal when unavailable (default: `false`)            |
-| `references` | bool    | No       | Whether to include references (default: `false`)            |
-| `fallback_lang` | string    | No       | Fallback language code if the preferred language is not available (default: `en`)            |
-
----
-
-## Deploy to Toolforge
-
-1. Shell into the Toolforge system:
-
-```bash
-ssh [UNIX shell username]@login.toolforge.org
-```
-
-2. Switch to tool user account:
-
-```bash
-become wd-textify
-```
-
-3. Build from Git:
-
-```bash
-toolforge build start https://github.com/philippesaade-wmde/WikidataTextifier.git
-```
+#### Query parameters
 
-4. Start the web service:
+| Name | Type | Required | Description |
+|---|---|---|---|
+| `id` | string | Yes | Comma-separated Wikidata IDs (for example: `Q42` or `Q42,Q2`). |
+| `pid` | string | No | Comma-separated property IDs to filter claims (for example: `P31,P279`). |
+| `lang` | string | No | Preferred language code (default: `en`). |
+| `fallback_lang` | string | No | Fallback language code (default: `en`). |
+| `format` | string | No | Output format: `json`, `text`, or `triplet` (default: `json`). |
+| `external_ids` | bool | No | Include `external-id` datatype claims (default: `true`). |
+| `all_ranks` | bool | No | Include all statement ranks instead of preferred/normal filtering (default: `false`). |
+| `qualifiers` | bool | No | Include qualifiers in claim values (default: `true`). |
+| `references` | bool | No | Include references in claim values (default: `false`). |
 
-```bash
-webservice buildservice start --mount all
-```
-
-5. Debugging the web service:
-
-Read the logs:
-```bash
-webservice logs
-```
+#### Example requests
 
-Open the service shell:
 ```bash
-webservice shell
+curl "https://wd-textify.wmcloud.org/?id=Q42"
+curl "https://wd-textify.wmcloud.org/?id=Q42&format=text&lang=en"
+curl "https://wd-textify.wmcloud.org/?id=Q42,Q2&pid=P31,P279&format=triplet"
 ```
diff --git a/main.py b/main.py
@@ -1,14 +1,16 @@
-from fastapi import FastAPI, HTTPException, Query, Request
-from fastapi.middleware.cors import CORSMiddleware
-from fastapi import BackgroundTasks
+"""FastAPI application that exposes Wikidata textification endpoints."""
+
+import os
+import time
 import traceback
+
 import requests
-import time
-import os
+from fastapi import BackgroundTasks, FastAPI, HTTPException, Query, Request
+from fastapi.middleware.cors import CORSMiddleware
 
-from src.Normalizer import TTLNormalizer, JSONNormalizer
-from src.WikidataLabel import WikidataLabel, LazyLabelFactory
 from src import utils
+from src.Normalizer import JSONNormalizer, TTLNormalizer
+from src.WikidataLabel import LazyLabelFactory, WikidataLabel
 
 # Start Fastapi app
 app = FastAPI(
@@ -32,68 +34,81 @@
 LABEL_CLEANUP_INTERVAL_SECONDS = int(os.environ.get("LABEL_CLEANUP_INTERVAL_SECONDS", 3600))
 _last_label_cleanup = 0.0
 
+
 @app.on_event("startup")
 async def startup():
+    """Initialize database resources required by the API."""
     WikidataLabel.initialize_database()
 
+
 @app.get(
     "/",
     responses={
         200: {
             "description": "Returns a list of relevant Wikidata property PIDs with similarity scores",
             "content": {
                 "application/json": {
-                    "example": [{
-                        "Q42": "Douglas Adams (human), English writer, humorist, and dramatist...",
-                    }]
+                    "example": [
+                        {
+                            "Q42": "Douglas Adams (human), English writer, humorist, and dramatist...",
+                        }
+                    ]
                 }
             },
         },
         422: {
             "description": "Missing or invalid query parameter",
-            "content": {
-                "application/json": {
-                    "example": {"detail": "Invalid format specified"}
-                }
-            },
+            "content": {"application/json": {"example": {"detail": "Invalid format specified"}}},
         },
     },
 )
 async def get_textified_wd(
-    request: Request, background_tasks: BackgroundTasks,
+    request: Request,
+    background_tasks: BackgroundTasks,
     id: str = Query(..., examples="Q42,Q2"),
     pid: str = Query(None, examples="P31,P279"),
-    lang: str = 'en',
-    format: str = 'json',
+    lang: str = "en",
+    format: str = "json",
     external_ids: bool = True,
     references: bool = False,
     all_ranks: bool = False,
     qualifiers: bool = True,
-    fallback_lang: str = 'en'
+    fallback_lang: str = "en",
 ):
-    """
-    Retrieve a Wikidata item with all labels or textual representations for an LLM.
-
-    Args:
-        id (str): The Wikidata item ID (e.g., "Q42").
-        pid (str): Comma-separated list of property IDs to filter claims (e.g., "P31,P279").
-        format (str): The format of the response, either 'json', 'text', or 'triplet'.
-        lang (str): The language code for labels (default is 'en').
-        external_ids (bool): If True, includes external IDs in the response.
-        all_ranks (bool): If True, includes statements of all ranks (preferred, normal, deprecated).
-        references (bool): If True, includes references in the response. (only available in JSON format)
-        qualifiers (bool): If True, includes qualifiers in the response.
-        fallback_lang (str): The fallback language code if the preferred language is not available.
-
-    Returns:
-        list: A list of dictionaries containing QIDs and the similarity scores.
+    """Retrieve Wikidata entities as structured JSON, natural text, or triplet lines.
+
+    This endpoint fetches one or more entities, resolves missing labels, and normalizes
+    claims into a compact representation suitable for downstream LLM use.
+
+    **Args:**
+
+    - **id** (str): Comma-separated Wikidata IDs to fetch (for example: `"Q42"` or `"Q42,Q2"`).
+    - **pid** (str, optional): Comma-separated property IDs used to filter returned claims (for example: `"P31,P279"`).
+    - **lang** (str): Preferred language code for labels and formatted values.
+    - **format** (str): Output format. One of `"json"`, `"text"`, or `"triplet"`.
+    - **external_ids** (bool): If `true`, include claims with datatype `external-id`.
+    - **references** (bool): If `true`, include references in claim values (JSON output only).
+    - **all_ranks** (bool): If `true`, include preferred, normal, and deprecated statement ranks.
+    - **qualifiers** (bool): If `true`, include qualifiers for claim values.
+    - **fallback_lang** (str): Fallback language used when `lang` is unavailable.
+    - **request** (Request): FastAPI request context object.
+    - **background_tasks** (BackgroundTasks): Background task manager used for cache cleanup.
+
+    **Returns:**
+
+    A dictionary keyed by requested entity ID (for example, `"Q42"`).
+    Each value depends on `format`:
+
+    - **json**: Structured entity payload with label, description, aliases, and claims.
+    - **text**: Human-readable summary text.
+    - **triplet**: Triplet-style text lines with labels and IDs.
     """
     try:
         filter_pids = []
         if pid:
-            filter_pids = [p.strip() for p in pid.split(',')]
+            filter_pids = [p.strip() for p in pid.split(",")]
 
-        qids = [q.strip() for q in id.split(',')]
+        qids = [q.strip() for q in id.split(",")]
         label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang)
 
         entities = {}
@@ -144,7 +159,9 @@ async def get_textified_wd(
                     fallback_lang=fallback_lang,
                     label_factory=label_factory,
                     debug=False,
-                ) if entity_data.get(qid) else None
+                )
+                if entity_data.get(qid)
+                else None
                 for qid in qids
             }
 
@@ -154,8 +171,10 @@ async def get_textified_wd(
                     all_ranks=all_ranks,
                     references=references,
                     filter_pids=filter_pids,
-                    qualifiers=qualifiers
-                ) if entity else None
+                    qualifiers=qualifiers,
+                )
+                if entity
+                else None
                 for qid, entity in entity_data.items()
             }
 
@@ -165,9 +184,9 @@ async def get_textified_wd(
                 return_data[qid] = None
                 continue
 
-            if format == 'text':
+            if format == "text":
                 results = entity.to_text(lang)
-            elif format == 'triplet':
+            elif format == "triplet":
                 results = entity.to_triplet()
             else:
                 results = entity.to_json()

diff --git a/pyproject.toml b/pyproject.toml
@@ -13,3 +13,30 @@ dependencies = [
     "sqlalchemy>=2.0.41",
     "uvicorn>=0.35.0",
 ]
+
+[dependency-groups]
+dev = [
+    "ruff>=0.9.0"
+]
+
+[tool.ruff]
+target-version = "py313"
+line-length = 120
+
+exclude = ["data/mysql"]
+
+[tool.ruff.lint]
+select = [
+    "E",   # pycodestyle errors
+    "F",   # Pyflakes (catches undefined names, unused imports, etc.)
+    "I",   # isort (import sorting)
+    "D",   # pydocstyle (function/class documentation)
+]
+
+[tool.ruff.lint.pydocstyle]
+convention = "google"
+
+[tool.ruff.lint.isort]
+known-first-party = [
+    "wikidatasearch"
+]