-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Summary
- Context:
GitHubRepositoryIdentityis responsible for generating canonical identifiers for GitHub repositories, including the collection name used in the Qdrant vector store. - Bug: The
canonicalCollectionNamemethod uses a hyphen (-) as a separator between the repository owner and the repository name, but both owner and repository names can themselves contain hyphens, leading to identity collisions. - Actual vs. expected: Two distinct repositories (e.g., owner
a-bwith repocand ownerawith repob-c) produce the exact same collection name (github-a-b-c), whereas they should always map to unique collection names. - Impact: When two repositories collide, they share the same Qdrant collection. Ingestion of one repository will erroneously "prune" (delete) all vectors belonging to the other repository because they are perceived as orphaned, leading to catastrophic data loss for conflicting repositories.
Code with bug
public String canonicalCollectionName() {
return "github-" + encodeCollectionSegment(owner) + "-" + encodeCollectionSegment(repository); // <-- BUG 🔴 Ambiguous separator leads to collisions
}Evidence
- Reproduction Test: A test case creating two distinct
GitHubRepositoryIdentityobjects for(a-b, c)and(a, b-c)confirms that both return the samecanonicalCollectionName():GitHubRepositoryIdentity identity1 = GitHubRepositoryIdentity.of("a-b", "c"); GitHubRepositoryIdentity identity2 = GitHubRepositoryIdentity.of("a", "b-c"); // identity1.canonicalCollectionName() -> "github-a-b-c" // identity2.canonicalCollectionName() -> "github-a-b-c"
- Logic Analysis:
encodeCollectionSegmentreturns the segment unchanged if it matches[a-z0-9-]+. Since botha-bandamatch this pattern, they are returned as-is. Joining them with another hyphen creates an ambiguous string where the boundary between owner and repository is lost. - Downstream Impact: In
GitHubRepoProcessor.purgeDeletedFileOrphans, the code scrolls all URLs in the collection and removes any not present in the current ingestion run. If two repositories share a collection, processing one will delete all data for the other.
Why has this bug gone undetected?
This bug has likely gone undetected because naming collisions between organization-project pairs on GitHub are relatively rare (e.g., one user having org-project/repo and another having org/project-repo). However, in a system designed to support arbitrary GitHub repositories, this poses a significant risk of silent data corruption and accidental deletion.
Recommended fix
Use a separator that is not allowed within the segments themselves. Since encodeCollectionSegment ensures segments only contain [a-z0-9-], the underscore (_) is a safe separator as it is also allowed in Qdrant collection names but guaranteed not to appear inside the encoded segments.
public String canonicalCollectionName() {
return "github-" + encodeCollectionSegment(owner) + "_" + encodeCollectionSegment(repository); // <-- FIX 🟢 Use underscore as unique separator
}Reactions are currently unavailable