Skip to content

[Detail Bug] Hyphen separator in canonicalCollectionName causes repository identity collisions #32

@detail-app

Description

@detail-app

Summary

  • Context: GitHubRepositoryIdentity is responsible for generating canonical identifiers for GitHub repositories, including the collection name used in the Qdrant vector store.
  • Bug: The canonicalCollectionName method uses a hyphen (-) as a separator between the repository owner and the repository name, but both owner and repository names can themselves contain hyphens, leading to identity collisions.
  • Actual vs. expected: Two distinct repositories (e.g., owner a-b with repo c and owner a with repo b-c) produce the exact same collection name (github-a-b-c), whereas they should always map to unique collection names.
  • Impact: When two repositories collide, they share the same Qdrant collection. Ingestion of one repository will erroneously "prune" (delete) all vectors belonging to the other repository because they are perceived as orphaned, leading to catastrophic data loss for conflicting repositories.

Code with bug

public String canonicalCollectionName() {
    return "github-" + encodeCollectionSegment(owner) + "-" + encodeCollectionSegment(repository); // <-- BUG 🔴 Ambiguous separator leads to collisions
}

Evidence

  1. Reproduction Test: A test case creating two distinct GitHubRepositoryIdentity objects for (a-b, c) and (a, b-c) confirms that both return the same canonicalCollectionName():
    GitHubRepositoryIdentity identity1 = GitHubRepositoryIdentity.of("a-b", "c");
    GitHubRepositoryIdentity identity2 = GitHubRepositoryIdentity.of("a", "b-c");
    // identity1.canonicalCollectionName() -> "github-a-b-c"
    // identity2.canonicalCollectionName() -> "github-a-b-c"
  2. Logic Analysis: encodeCollectionSegment returns the segment unchanged if it matches [a-z0-9-]+. Since both a-b and a match this pattern, they are returned as-is. Joining them with another hyphen creates an ambiguous string where the boundary between owner and repository is lost.
  3. Downstream Impact: In GitHubRepoProcessor.purgeDeletedFileOrphans, the code scrolls all URLs in the collection and removes any not present in the current ingestion run. If two repositories share a collection, processing one will delete all data for the other.

Why has this bug gone undetected?

This bug has likely gone undetected because naming collisions between organization-project pairs on GitHub are relatively rare (e.g., one user having org-project/repo and another having org/project-repo). However, in a system designed to support arbitrary GitHub repositories, this poses a significant risk of silent data corruption and accidental deletion.

Recommended fix

Use a separator that is not allowed within the segments themselves. Since encodeCollectionSegment ensures segments only contain [a-z0-9-], the underscore (_) is a safe separator as it is also allowed in Qdrant collection names but guaranteed not to appear inside the encoded segments.

public String canonicalCollectionName() {
    return "github-" + encodeCollectionSegment(owner) + "_" + encodeCollectionSegment(repository); // <-- FIX 🟢 Use underscore as unique separator
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions