forked from exowanderer/WikidataChat
-
Notifications
You must be signed in to change notification settings - Fork 0
Db sharding #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
philippesaade-wmde
wants to merge
15
commits into
main
Choose a base branch
from
db_sharding
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Db sharding #3
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
51e925b
Major Change: Language and Entity Type Sharding
philippesaade-wmde 53afc75
Change version to 3.0.0
philippesaade-wmde c64e7ac
Clamping and removing negative similarities
philippesaade-wmde cc84b19
Refactor ADR for Language and Entity-Type Sharding
philippesaade-wmde b1de483
Including docstrings for the unit tests and benchmarks
philippesaade-wmde 70d1761
Apply suggestion from @itamargiv
philippesaade-wmde 39a0fd0
Apply suggestion from @itamargiv
philippesaade-wmde 18b96c3
Apply suggestion from @itamargiv
philippesaade-wmde b3398f9
Adding docstrings to API output representations
philippesaade-wmde 99f7892
Apply suggestion from @exowanderer
philippesaade-wmde 8137db3
Apply suggestion from @exowanderer
philippesaade-wmde 8360c92
Update User-Agent header for translator service
philippesaade-wmde de116b8
Apply suggestion from @philippesaade-wmde
philippesaade-wmde 1962ee8
Adding role=tooltip
philippesaade-wmde 8c4c921
Retruning vectors is re-enabled, unit tests are fixed to validate tha…
philippesaade-wmde File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,14 +1,140 @@ | ||
| # WikidataSearch | ||
|
|
||
| ## Introduction | ||
| WikidataSearch is a web application and API designed to facilitate the connection between users and the Wikidata Vector Database developed as pasrt of the [Wikidata Embedding Project](https://www.wikidata.org/wiki/Wikidata:Embedding_Project). | ||
| WikidataSearch is the API and web app for semantic retrieval over the Wikidata Vector Database from the [Wikidata Embedding Project](https://www.wikidata.org/wiki/Wikidata:Embedding_Project). | ||
|
|
||
| **Webapp:** [wd-vectordb.wmcloud.org](https://wd-vectordb.wmcloud.org/) \ | ||
| **Docs:** [wd-vectordb.wmcloud.org/docs](https://wd-vectordb.wmcloud.org/docs) \ | ||
| **Project Page:** [wikidata.org/wiki/Wikidata:Embedding_Project](https://www.wikidata.org/wiki/Wikidata:Embedding_Project) | ||
| This repository powers the public service. The intended usage is the hosted API, not running your own deployment. | ||
|
|
||
| **Hosted Web App:** [https://wd-vectordb.wmcloud.org/](https://wd-vectordb.wmcloud.org/) | ||
| **Hosted API Docs (OpenAPI):** [https://wd-vectordb.wmcloud.org/docs](https://wd-vectordb.wmcloud.org/docs) | ||
| **Project Page:** [https://www.wikidata.org/wiki/Wikidata:Vector_Database](https://www.wikidata.org/wiki/Wikidata:Vector_Database) | ||
|
|
||
| ## Hosted API Usage | ||
|
|
||
| Base URL: | ||
|
|
||
| ```text | ||
| https://wd-vectordb.wmcloud.org | ||
| ``` | ||
|
|
||
| Use a descriptive `User-Agent` for query endpoints. Generic user agents are rejected. | ||
|
|
||
| Example header: | ||
|
|
||
| ```text | ||
| User-Agent: WikidataSearch-Client/1.0 (your-email@example.org) | ||
| ``` | ||
|
|
||
| Current operational constraints: | ||
|
|
||
| - Rate limit is applied per `User-Agent` (default: `30/minute`). | ||
| - `return_vectors=true` is currently disabled and returns `422`. | ||
|
|
||
| ## API Endpoints | ||
|
|
||
| ### `GET /item/query/` | ||
|
|
||
| Semantic + keyword search for Wikidata items (QIDs), fused with Reciprocal Rank Fusion (RRF). | ||
|
|
||
| Parameters: | ||
|
|
||
| - `query` (required): natural-language query or ID. | ||
| - `lang` (default: `all`): vector shard language; unknown languages are translated then searched globally. | ||
| - `K` (default/max: `50`): number of top results requested. | ||
| - `instanceof` (optional): comma-separated QIDs used as `P31` filter. | ||
| - `rerank` (default: `false`): apply reranker on textified Wikidata content. | ||
| - `return_vectors` (currently disabled). | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| curl -sG 'https://wd-vectordb.wmcloud.org/item/query/' \ | ||
| --data-urlencode 'query=Douglas Adams' \ | ||
| --data-urlencode 'lang=en' \ | ||
| --data-urlencode 'K=10' \ | ||
| -H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)' | ||
| ``` | ||
|
|
||
| ### `GET /property/query/` | ||
|
|
||
| Semantic + keyword search for Wikidata properties (PIDs), fused with RRF. | ||
|
|
||
| Parameters: | ||
|
|
||
| - `query` (required) | ||
| - `lang` (default: `all`) | ||
| - `K` (default/max: `50`) | ||
| - `instanceof` (optional): comma-separated QIDs used as `P31` filter. | ||
| - `exclude_external_ids` (default: `false`): excludes properties with datatype `external-id`. | ||
| - `rerank` (default: `false`) | ||
| - `return_vectors` (currently disabled) | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| curl -sG 'https://wd-vectordb.wmcloud.org/property/query/' \ | ||
| --data-urlencode 'query=instance of' \ | ||
| --data-urlencode 'lang=en' \ | ||
| --data-urlencode 'exclude_external_ids=true' \ | ||
| -H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)' | ||
| ``` | ||
|
|
||
| ### `GET /similarity-score/` | ||
|
|
||
| Similarity scoring for a fixed list of Wikidata IDs (QIDs and/or PIDs) against one query. | ||
|
|
||
| Parameters: | ||
|
|
||
| - `query` (required) | ||
| - `qid` (required): comma-separated IDs, for example `Q42,Q5,P31`. | ||
| - `lang` (default: `all`) | ||
| - `return_vectors` (currently disabled) | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| curl -sG 'https://wd-vectordb.wmcloud.org/similarity-score/' \ | ||
| --data-urlencode 'query=science fiction writer' \ | ||
| --data-urlencode 'qid=Q42,Q25169,P31' \ | ||
| -H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)' | ||
| ``` | ||
|
|
||
| ## Response Shape | ||
|
|
||
| `/item/query/` returns objects with: | ||
|
|
||
| - `QID` | ||
| - `similarity_score` | ||
| - `rrf_score` | ||
| - `source` (`Vector Search`, `Keyword Search`, or both) | ||
| - `reranker_score` (when `rerank=true`) | ||
|
|
||
| `/property/query/` returns the same shape with `PID` instead of `QID`. | ||
|
|
||
| `/similarity-score/` returns: | ||
|
|
||
| - `QID` or `PID` | ||
| - `similarity_score` | ||
|
|
||
| ## Architecture | ||
|
|
||
| High-level request flow: | ||
|
|
||
| 1. FastAPI route receives the query, enforces user-agent policy, and rate limit. | ||
| 2. `HybridSearch` orchestrates retrieval: | ||
| - Vector path: embeds query with Jina embeddings and searches Astra DB vector collections across language shards in parallel. | ||
| - Keyword path: runs Wikidata keyword search against `wikidata.org`. | ||
| 3. Results are fused with Reciprocal Rank Fusion (RRF), preserving source attribution. | ||
| 4. Optional reranking fetches Wikidata text representations and reorders top hits with Jina reranker. | ||
| 5. JSON response is returned and request metadata is logged for analytics. | ||
|
|
||
| Main components in this repo: | ||
|
|
||
| - API app and routing: `wikidatasearch/main.py`, `wikidatasearch/routes/` | ||
| - Retrieval orchestration: `wikidatasearch/services/search/HybridSearch.py` | ||
| - Vector retrieval backend: `wikidatasearch/services/search/VectorSearch.py` | ||
| - Keyword retrieval backend: `wikidatasearch/services/search/KeywordSearch.py` | ||
| - Embeddings/reranking client: `wikidatasearch/services/jina.py` | ||
|
|
||
| ## License | ||
| WikidataSearch is open-source software licensed under the MIT License. You are free to use, modify, and distribute the software as you wish. We kindly ask for a citation to this repository if you use WikidataSearch in your projects. | ||
|
|
||
| ## Contact | ||
| For questions, comments, or discussions, please open an issue on this GitHub repository. We are committed to fostering a welcoming and collaborative community. | ||
| See [LICENSE](LICENSE). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| # **Architecture Decision Record**: Language and Entity-Type Sharding for the Wikidata Vector Database | ||
|
|
||
| **Status**: Implemented | ||
| **Date**: 12 Mar 2026 | ||
|
|
||
| ## Context | ||
|
|
||
| The Wikidata Vector Database initially stored all computed language embeddings for all entity types in a single vector index. As language coverage expanded, this architecture led to rapid index growth, degraded query performance, and reduced retrieval precision. We decided that our operations require a new vector database architecture to scale language support while maintaining search quality. | ||
|
|
||
| The Wikidata Vector Database API exposes the following endpoints to query the vector database: | ||
|
|
||
| * */item/query/* | ||
| * */property/query/* | ||
| * */similarity-score/* | ||
|
|
||
| The first two */query/* endpoints perform hybrid search using vector search and keyword search, then combine results with Reciprocal Rank Fusion (RRF), optionally followed by reranking with a provided reranker model. | ||
|
|
||
| The initial architecture used a single vector database containing all computed vectors of Wikidata entities, where: | ||
|
|
||
| * Each entity stored one embedding per language | ||
| * All languages were stored in the same vector database | ||
| * Items and properties were stored together | ||
|
|
||
| As support for additional languages expanded, this architecture introduced several concerns: | ||
|
|
||
| * **Database growth:** Each additional language increased the number of vectors stored per entity, resulting in linear growth in database size. | ||
| * **Search performance degradation:** Larger vector indexes increase query latency. This has been evident when comparing query efficiency between the current database and previous experiments on a subset. | ||
| * **Decreased retrieval precision:** Larger vector indexes reduce the effectiveness of approximate nearest neighbour (ANN) search. As the index grows, the probability of missing highly relevant vectors increases related to the limits of ANN approximation. | ||
| * **Limited control over language exposure:** Results across languages depended solely on embedding similarity scores, making it difficult to ensure balanced exposure of entities across languages. | ||
|
|
||
| These limitations made the single multilingual vector database increasingly slow and difficult to scale as language coverage increased. | ||
|
|
||
| ## Decision | ||
|
|
||
| The vector database architecture will be migrated to a sharded design based on language and entity type. Instead of a single database containing all vectors, the system will use **a separate vector database per language per entity type.** | ||
|
|
||
| Entity types include: | ||
|
|
||
| * Wikidata **items** (\~21 million entities) | ||
| * Wikidata **properties** (\~12 thousand entities) | ||
|
|
||
| Items and properties are stored in separate vector databases because they are queried through different API endpoints and are never retrieved together. Because the number of items is several orders of magnitude larger than the number of properties, properties could be underrepresented in search results if both entity types shared the same vector index. | ||
|
|
||
| Additionally, combining items and properties in a single index would require additional filtering during vector search to separate entity types. Separating them simplifies query execution by removing the need for entity-type filtering inside the vector database. | ||
|
|
||
| Languages currently supported: | ||
|
|
||
| * **English** (\~21 million items) | ||
| * **French** (\~10.6 million items) | ||
| * **Arabic** (\~3 million items) | ||
| * **German** (\~9.7 million items) | ||
|
|
||
| The new deployment will contain 8 vector databases: | ||
|
|
||
| | Entity Type | Language | Database Name | \# Vectors | | ||
| | :---- | :---- | :---- | :---- | | ||
| | Item | English (EN) | items\_en | 21,127,781 | | ||
| | Item | French (FR) | items\_fr | 10,662,599 | | ||
| | Item | Arabic (AR) | items\_ar | 2,986,814 | | ||
| | Item | German (DE) | items\_de | 9,793,965 | | ||
| | Property | English (EN) | properties\_en | 24,459 | | ||
| | Property | French (FR) | properties\_fr | 21,008 | | ||
| | Property | Arabic (AR) | properties\_ar | 16,529 | | ||
| | Property | German (DE) | properties\_de | 14,174 | | ||
|
|
||
| The new architecture is designed as a **general sharding pattern**, allowing more languages to be added without increasing the size of existing vector databases. | ||
|
|
||
| ## Query Orchestration Strategy | ||
|
|
||
| The API server is responsible for orchestrating queries across the vector database shards. | ||
|
|
||
| ### Vector Search Execution | ||
|
|
||
| When a query is received: | ||
|
|
||
| 1. The query is shared with an API server that computes and returns the embedding vectors. | ||
| 2. The API determines which language shards the request calls based on the \`lang\` parameter. | ||
| 3. Vector searches are executed in parallel across the relevant shards via a second API, which calls the vector database similarity search protocol for each shard. | ||
| 4. Results from each shard are collected. | ||
|
|
||
| ### Language Selection | ||
|
|
||
| The language (‘lang’) parameter determines which language shards are queried: | ||
| **Specific language:** Only the corresponding language shard is queried. | ||
| **All language:** All language shards for the entity type are queried in parallel. | ||
| **Unsupported language:** The query is translated to English, and the system falls back to querying all shards. | ||
|
|
||
| **Future consideration**: define a default subset of languages to query, rather than querying all shards. This may become necessary if the number of supported languages increases significantly. | ||
|
|
||
| ### Query Endpoints | ||
|
|
||
| Search endpoints (/item/query/ and /property/query/) combine **vector search** and **keyword search** using **Reciprocal Rank Fusion (RRF).** | ||
|
|
||
| Queries are executed against vector databases corresponding to the requested entity type (items or properties). Within that entity type, vector search is executed independently on each relevant language shard. | ||
|
|
||
| * */item/query/* searches item vector databases | ||
| * */property/query/* searches property vector databases | ||
|
|
||
| RRF provides a ranking method that is independent of the raw similarity scores produced by individual retrieval methods or language shards. Entities are ranked by their RRF score, which increases when an entity appears in multiple result lists, such as: | ||
|
|
||
| * Results returned from multiple language shards | ||
| * Both vector search and keyword search results | ||
|
|
||
| Entities that appear frequently and at higher ranks across these resulting lists receive higher final rankings. | ||
|
|
||
| ### Similarity Score Endpoint | ||
|
|
||
| The */similarity-score/* endpoint behaves differently from the search endpoints. Instead of retrieving entities, the user provides a list of entities and requests their similarity scores relative to a given query. | ||
|
|
||
| For the requested entity type, the API performs the following steps: | ||
|
|
||
| 1. Queries relevant language shards in parallel. | ||
| 2. Computes similarity scores between the query embedding and the vectors for the provided entities. | ||
| 3. For each entity, the highest similarity score across all queried language shards is selected. | ||
|
|
||
| This approach ensures a single, deterministic similarity score per entity while accounting for the best available language representation. | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Benefits | ||
|
|
||
| **Scalable language support:** | ||
| Adding a new language requires adding a new vector database rather than expanding an existing one. | ||
|
|
||
| **Improved search precision:** | ||
| Smaller vector indexes reduce nearest neighbour approximation errors and improve retrieval quality. | ||
|
|
||
| **Improved query efficiency for single-language searches:** | ||
| Queries targeting a specific language search a smaller index. | ||
|
|
||
| **Better control of multilingual exposure:** | ||
| Using RRF to combine shard results ensures that entities from different languages can appear in results instead of relying solely on embedding similarity scores. | ||
|
|
||
| **Reduced index size per database:** | ||
| Smaller indexes are easier to maintain and scale operationally. | ||
|
|
||
| ### Trade-offs | ||
|
|
||
| **Increased API complexity:** | ||
| The API server must now coordinate multiple vector searches, parallelize queries, and fuse results across shards. | ||
|
|
||
| **Additional development effort:** | ||
| The migration required changes to query orchestration, result fusion logic, and search APIs. | ||
|
|
||
| ## Operational Considerations | ||
|
|
||
| Shard growth occurs independently per language and entity type. Differences in vector counts between languages are expected. Monitoring should therefore focus on system health and query performance, including metrics such as query latency, query failure rates, and API timeout or retry rates. | ||
|
|
||
| Shards are logically independent. Failure or degradation of a single language shard should not prevent the API from returning results from other shards. | ||
|
|
||
| **Adding a new language requires:** | ||
|
|
||
| 1. Creating a new item vector database for the added language. | ||
| 2. Creating a new property vector database for the added language. | ||
| 3. Adding the appropriate language-specific configuration in [WikidataTextifier](https://github.com/philippesaade-wmde/WikidataTextifier/blob/main/src/Textifier/language_variables.json). | ||
| 4. Generating embeddings for all entities in the new language vector database. | ||
| 5. Generating embeddings for properties, including embeddings that incorporate example usage. | ||
| 6. Updating the API configuration to include the new shards. | ||
|
|
||
| Because queries may fan out across multiple shards, system capacity should account for the increased parallel query load as additional languages are introduced. | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| ### Single multilingual vector database | ||
|
|
||
| The previous architecture stored all vectors in a single database. This approach was rejected due to poor scalability as language coverage increased. | ||
|
|
||
| ### Entity-type split only | ||
|
|
||
| Another option was to separate items and properties, but keep all languages in a single database. This was rejected because language growth would still increase index size and degrade approximate nearest neighbour (similarity) performance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.