Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: "Ruff Lint"

on:
pull_request:
branches: ["main"]

permissions:
contents: read

jobs:
ruff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Set up Python
run: uv python install

- name: Run Ruff linter
run: uv run ruff check .

- name: Run Ruff formatter check
run: uv run ruff format --check .
96 changes: 31 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,47 @@
# Wikidata Textifier

**Wikidata Textifier** is an API that transforms Wikidata items into compact format for use in LLMs and GenAI applications. It resolves missing labels of properties and claim values by querying the Wikidata Action API, making it efficient and suitable for AI pipelines.
**Wikidata Textifier** is an API that transforms Wikidata entities into compact outputs for LLM and GenAI use cases.
It resolves missing labels for properties and claim values using the Wikidata Action API and caches labels to reduce repeated lookups.

🔗 Live API: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
Live API: [wd-textify.wmcloud.org](https://wd-textify.wmcloud.org/) \
API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)

---
## Features

## Functionalities
- Textify Wikidata entities as `json`, `text`, or `triplet`.
- Resolve labels for linked entities and properties.
- Cache labels in MariaDB for faster repeated requests.
- Support multilingual output with fallback language support.
- Avoid SPARQL and use Wikidata Action API / EntityData endpoints.

- **Textifies** any Wikidata item into a readable or JSON format suitable for LLMs.
- **Resolves all labels**, including those missing when querying the Wikidata API.
- **Caches labels** for 90 days to boost performance and reduce API load.
- **Avoids SPARQL** and uses the Wikidata Action API for better efficiency and compatibility.
- **Hosted on Toolforge**: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
## Output Formats

---
- `json`: Structured representation with claims (and optionally qualifiers/references).
- `text`: Readable summary including label, description, aliases, and attributes.
- `triplet`: Triplet-style lines with labels and IDs for graph-style traversal.

## Formats

- **Text**: A textual representation or summary of the Wikidata item, including its label, description, aliases, and claims. Useful for helping LLMs understand what the item represents.
- **Triplet**: Outputs each triplet as a structured line, including labels and IDs, but omits descriptions and aliases. Ideal for agentic LLMs to traverse and explore Wikidata.
- **JSON**: A structured and compact representation of the full item, suitable for custom formats.

---

## API Usage
## API

### `GET /`

#### Query Parameters

| Name | Type | Required | Description |
|----------------|---------|----------|-----------------------------------------------------------------------------|
| `id` | string | Yes | Wikidata item ID (e.g., `Q42`) |
| `lang` | string | No | Language code for labels (default: `en`) |
| `format` | string | No | The format of the response, either 'json', 'text', or 'triplet' (default: `json`) |
| `external_ids` | bool | No | Whether to include external IDs in the output (default: `true`) |
| `all_ranks` | bool | No | If false, returns ranked preferred statements, falling back to normal when unavailable (default: `false`) |
| `references` | bool | No | Whether to include references (default: `false`) |
| `fallback_lang` | string | No | Fallback language code if the preferred language is not available (default: `en`) |

---

## Deploy to Toolforge

1. Shell into the Toolforge system:

```bash
ssh [UNIX shell username]@login.toolforge.org
```

2. Switch to tool user account:

```bash
become wd-textify
```

3. Build from Git:

```bash
toolforge build start https://github.com/philippesaade-wmde/WikidataTextifier.git
```
#### Query parameters

4. Start the web service:
| Name | Type | Required | Description |
|---|---|---|---|
| `id` | string | Yes | Comma-separated Wikidata IDs (for example: `Q42` or `Q42,Q2`). |
| `pid` | string | No | Comma-separated property IDs to filter claims (for example: `P31,P279`). |
| `lang` | string | No | Preferred language code (default: `en`). |
| `fallback_lang` | string | No | Fallback language code (default: `en`). |
| `format` | string | No | Output format: `json`, `text`, or `triplet` (default: `json`). |
| `external_ids` | bool | No | Include `external-id` datatype claims (default: `true`). |
| `all_ranks` | bool | No | Include all statement ranks instead of preferred/normal filtering (default: `false`). |
| `qualifiers` | bool | No | Include qualifiers in claim values (default: `true`). |
| `references` | bool | No | Include references in claim values (default: `false`). |

```bash
webservice buildservice start --mount all
```

5. Debugging the web service:

Read the logs:
```bash
webservice logs
```
#### Example requests

Open the service shell:
```bash
webservice shell
curl "https://wd-textify.wmcloud.org/?id=Q42"
curl "https://wd-textify.wmcloud.org/?id=Q42&format=text&lang=en"
curl "https://wd-textify.wmcloud.org/?id=Q42,Q2&pid=P31,P279&format=triplet"
```
103 changes: 61 additions & 42 deletions main.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
from fastapi import FastAPI, HTTPException, Query, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi import BackgroundTasks
"""FastAPI application that exposes Wikidata textification endpoints."""

import os
import time
import traceback

import requests
import time
import os
from fastapi import BackgroundTasks, FastAPI, HTTPException, Query, Request
from fastapi.middleware.cors import CORSMiddleware

from src.Normalizer import TTLNormalizer, JSONNormalizer
from src.WikidataLabel import WikidataLabel, LazyLabelFactory
from src import utils
from src.Normalizer import JSONNormalizer, TTLNormalizer
from src.WikidataLabel import LazyLabelFactory, WikidataLabel

# Start Fastapi app
app = FastAPI(
Expand All @@ -32,68 +34,81 @@
LABEL_CLEANUP_INTERVAL_SECONDS = int(os.environ.get("LABEL_CLEANUP_INTERVAL_SECONDS", 3600))
_last_label_cleanup = 0.0


@app.on_event("startup")
async def startup():
"""Initialize database resources required by the API."""
WikidataLabel.initialize_database()


@app.get(
"/",
responses={
200: {
"description": "Returns a list of relevant Wikidata property PIDs with similarity scores",
"content": {
"application/json": {
"example": [{
"Q42": "Douglas Adams (human), English writer, humorist, and dramatist...",
}]
"example": [
{
"Q42": "Douglas Adams (human), English writer, humorist, and dramatist...",
}
]
}
},
},
422: {
"description": "Missing or invalid query parameter",
"content": {
"application/json": {
"example": {"detail": "Invalid format specified"}
}
},
"content": {"application/json": {"example": {"detail": "Invalid format specified"}}},
},
},
)
async def get_textified_wd(
request: Request, background_tasks: BackgroundTasks,
request: Request,
background_tasks: BackgroundTasks,
id: str = Query(..., examples="Q42,Q2"),
pid: str = Query(None, examples="P31,P279"),
lang: str = 'en',
format: str = 'json',
lang: str = "en",
format: str = "json",
external_ids: bool = True,
references: bool = False,
all_ranks: bool = False,
qualifiers: bool = True,
fallback_lang: str = 'en'
fallback_lang: str = "en",
):
"""
Retrieve a Wikidata item with all labels or textual representations for an LLM.

Args:
id (str): The Wikidata item ID (e.g., "Q42").
pid (str): Comma-separated list of property IDs to filter claims (e.g., "P31,P279").
format (str): The format of the response, either 'json', 'text', or 'triplet'.
lang (str): The language code for labels (default is 'en').
external_ids (bool): If True, includes external IDs in the response.
all_ranks (bool): If True, includes statements of all ranks (preferred, normal, deprecated).
references (bool): If True, includes references in the response. (only available in JSON format)
qualifiers (bool): If True, includes qualifiers in the response.
fallback_lang (str): The fallback language code if the preferred language is not available.

Returns:
list: A list of dictionaries containing QIDs and the similarity scores.
"""Retrieve Wikidata entities as structured JSON, natural text, or triplet lines.

This endpoint fetches one or more entities, resolves missing labels, and normalizes
claims into a compact representation suitable for downstream LLM use.

**Args:**

- **id** (str): Comma-separated Wikidata IDs to fetch (for example: `"Q42"` or `"Q42,Q2"`).
- **pid** (str, optional): Comma-separated property IDs used to filter returned claims (for example: `"P31,P279"`).
- **lang** (str): Preferred language code for labels and formatted values.
- **format** (str): Output format. One of `"json"`, `"text"`, or `"triplet"`.
- **external_ids** (bool): If `true`, include claims with datatype `external-id`.
- **references** (bool): If `true`, include references in claim values (JSON output only).
- **all_ranks** (bool): If `true`, include preferred, normal, and deprecated statement ranks.
- **qualifiers** (bool): If `true`, include qualifiers for claim values.
- **fallback_lang** (str): Fallback language used when `lang` is unavailable.
- **request** (Request): FastAPI request context object.
- **background_tasks** (BackgroundTasks): Background task manager used for cache cleanup.

**Returns:**

A dictionary keyed by requested entity ID (for example, `"Q42"`).
Each value depends on `format`:

- **json**: Structured entity payload with label, description, aliases, and claims.
- **text**: Human-readable summary text.
- **triplet**: Triplet-style text lines with labels and IDs.
"""
try:
filter_pids = []
if pid:
filter_pids = [p.strip() for p in pid.split(',')]
filter_pids = [p.strip() for p in pid.split(",")]

qids = [q.strip() for q in id.split(',')]
qids = [q.strip() for q in id.split(",")]
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang)

entities = {}
Expand Down Expand Up @@ -144,7 +159,9 @@ async def get_textified_wd(
fallback_lang=fallback_lang,
label_factory=label_factory,
debug=False,
) if entity_data.get(qid) else None
)
if entity_data.get(qid)
else None
for qid in qids
}

Expand All @@ -154,8 +171,10 @@ async def get_textified_wd(
all_ranks=all_ranks,
references=references,
filter_pids=filter_pids,
qualifiers=qualifiers
) if entity else None
qualifiers=qualifiers,
)
if entity
else None
for qid, entity in entity_data.items()
}

Expand All @@ -165,9 +184,9 @@ async def get_textified_wd(
return_data[qid] = None
continue

if format == 'text':
if format == "text":
results = entity.to_text(lang)
elif format == 'triplet':
elif format == "triplet":
results = entity.to_triplet()
else:
results = entity.to_json()
Expand Down
27 changes: 27 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,30 @@ dependencies = [
"sqlalchemy>=2.0.41",
"uvicorn>=0.35.0",
]

[dependency-groups]
dev = [
"ruff>=0.9.0"
]

[tool.ruff]
target-version = "py313"
line-length = 120

exclude = ["data/mysql"]

[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"F", # Pyflakes (catches undefined names, unused imports, etc.)
"I", # isort (import sorting)
"D", # pydocstyle (function/class documentation)
]

[tool.ruff.lint.pydocstyle]
convention = "google"

[tool.ruff.lint.isort]
known-first-party = [
"wikidatasearch"
]
Loading
Loading