Skip to content

feat: add C language support via tree-sitter-c#433

Open
dj0nes wants to merge 4 commits intovitali87:mainfrom
dj0nes:feat/c-language-support
Open

feat: add C language support via tree-sitter-c#433
dj0nes wants to merge 4 commits intovitali87:mainfrom
dj0nes:feat/c-language-support

Conversation

@dj0nes
Copy link

@dj0nes dj0nes commented Mar 8, 2026

Closes #128

Summary

  • Adds SupportedLanguage.C and the full plumbing needed to index .c files with tree-sitter-c
  • _c_get_name() correctly unwraps pointer_declarator chains so pointer-return functions (e.g. br_pixelmap *BrPixelmapAllocate(...)) are named correctly, unlike a naive port of the C++ approach
  • C_FQN_SPEC scopes to translation_unit, struct_specifier, union_specifier, enum_specifier (no namespaces in C)
  • Queries: function_definition only (dropped declaration which is too broad and catches variable/typedef declarations), struct/union/enum for classes, call_expression for calls, preproc_include for imports
  • LanguageMetadata status set to DEV — solid for standard C, known limitation with unexpanded macros (see below)
  • test_language_node_coverage.py and test_handler_registry.py updated to cover C

Known limitation: calling-convention macros

Files that use macros between the return type and function name (Watcom C, Windows WINAPI, etc.) produce ERROR nodes in tree-sitter because the grammar cannot expand macros. Functions declared this way will not be indexed. A preprocessor pass (e.g. cpp -P or pcpp) before parsing would fix this and is worth a follow-up.

.h file handling

Headers are still routed to the C++ parser by default. A per-directory heuristic (C-only dirs use C parser, mixed dirs use C++ parser) was prototyped in the original issue #128 patch and can be added as a follow-up without changing the core design here.

Test plan

  • test_language_node_coverage.py -- C in extension mapping, language spec params
  • test_handler_registry.py -- C returns BaseLanguageHandler, has all protocol methods
  • test_fqn_resolver.py -- no regressions
  • test_cpp_basic_syntax.py / test_cpp_comprehensive.py -- no regressions
  • Manual: parsed BRender-v1.3.2/core/fw/token.c and extracted 8 functions correctly; pointer-return functions like br_pixelmap *BrPixelmapAllocate() resolve to the correct name

dj0nes added 3 commits March 7, 2026 01:01
pydantic-ai uses function docstrings as the tool description field.
Without a docstring, it sends null, which LM Studio's OpenAI-compatible
API rejects with: tools.N.type: invalid_string.

Add docstrings to semantic_search_functions and get_function_source_by_id
so both tools have a valid description string.
Use the existing td.SEMANTIC_SEARCH and td.GET_FUNCTION_SOURCE constants
as explicit description= arguments to the Tool() constructor, consistent
with every other tool factory in the codebase. Remove docstrings added
in the previous commit, which violated the project no-docstrings rule.

This ensures LM Studio and other strict OpenAI-compatible backends
receive a valid non-null description field in the tool schema.
Adds basic C language parsing support, resolving issue vitali87#128.

- Add SupportedLanguage.C and TreeSitterModule.C to constants
- Add C_EXTENSIONS (.c only), node type tuples, and LANGUAGE_METADATA
- Add _c_get_name() in language_spec.py that correctly unwraps
  pointer_declarator chains for pointer-return functions like
  `br_pixelmap *BrPixelmapAllocate(...)`
- Add C_FQN_SPEC and LANGUAGE_SPECS[C] with function/class/call queries
- Add tree-sitter-c LanguageImport in parser_loader.py
- Add tree-sitter-c>=0.24.1 dependency
- Update test_language_node_coverage.py and test_handler_registry.py
  to cover the C language

Note: .h files remain parsed as C++ by default. Files using calling
convention macros between return type and function name (e.g. Watcom C's
BR_RESIDENT_ENTRY) will produce ERROR nodes and won't be indexed — this
is a known tree-sitter limitation with unexpanded macros.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the codebase by integrating full C language support. It introduces the necessary infrastructure to parse, index, and analyze C source files using tree-sitter-c, including specialized naming conventions for C functions and tailored language specifications. This expansion allows the system to process C code effectively, laying the groundwork for broader language compatibility while acknowledging and documenting current limitations.

Highlights

  • C Language Support Added: Introduced comprehensive C language support, including file extension recognition, a new SupportedLanguage.C enum, and integration with tree-sitter-c.
  • Accurate C Function Naming: Implemented specialized logic (_c_get_name) to accurately resolve function names in C, particularly for functions returning pointers, by correctly unwrapping pointer_declarator chains.
  • C-specific Language Specifications: Defined C-specific FQN (Fully Qualified Name) and Language Specification constants, tailoring scope, function, class, call, and import node types for C's syntax, including translation_unit, struct_specifier, union_specifier, and enum_specifier.
  • Language Metadata and Limitations: Updated the language metadata to reflect C's DEV status, acknowledging current limitations with unexpanded macros and providing a path for future improvements.
  • Expanded Test Coverage: Expanded test coverage to include C language handler functionality and node coverage, ensuring proper integration and behavior for the new language.
  • Dependency Update: Added tree-sitter-c as a project dependency to enable the new C parsing capabilities.
Changelog
  • codebase_rag/constants.py
    • Added .c to EXT_C and C_EXTENSIONS.
    • Added SupportedLanguage.C enum member.
    • Defined LanguageMetadata for C with DEV status.
    • Added TreeSitterModule.C for tree_sitter_c.
    • Introduced FQN_C_SCOPE_TYPES, FQN_C_FUNCTION_TYPES, SPEC_C_FUNCTION_TYPES, SPEC_C_CLASS_TYPES, SPEC_C_MODULE_TYPES, SPEC_C_CALL_TYPES, and SPEC_C_PACKAGE_INDICATORS.
  • codebase_rag/language_spec.py
    • Added _c_unwrap_declarator to handle pointer declarators.
    • Implemented _c_get_name for C-specific name resolution, especially for function definitions.
    • Defined C_FQN_SPEC using C-specific constants and _c_get_name.
    • Registered C_FQN_SPEC and LanguageSpec for SupportedLanguage.C.
  • codebase_rag/parser_loader.py
    • Added LanguageImport entry for SupportedLanguage.C and TreeSitterModule.C.
  • codebase_rag/tests/test_handler_registry.py
    • Added test_returns_base_handler_for_c to verify C handler instantiation.
    • Included SupportedLanguage.C in test_handler_has_all_protocol_methods and test_handler_methods_are_callable parameterizations.
  • codebase_rag/tests/test_language_node_coverage.py
    • Imported C_EXTENSIONS.
    • Added SupportedLanguage.C and C_EXTENSIONS to test_each_language_has_file_extensions.
    • Added (".c", SupportedLanguage.C) to test_language_spec_has_correct_extensions.
  • codebase_rag/tools/semantic_search.py
    • Updated Tool instantiations to include explicit description arguments for semantic_search_functions and get_function_source_by_id.
  • pyproject.toml
    • Added tree-sitter-c>=0.24.1 to dependencies.
    • Added tree-sitter-c>=0.21.0 to treesitter-full extra dependencies.
  • uv.lock
    • Updated code-graph-rag version to 0.0.116.
    • Added tree-sitter-c as a dependency and to treesitter-full extra.
    • Added tree-sitter-c package details with various wheel URLs and hashes.
Activity
  • No human activity (comments, reviews, approvals) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the C programming language, a significant and well-implemented feature. The changes are comprehensive, covering constants, language specifications, parser loading, and adding corresponding tests. The approach to handle C-specific syntax, like pointer-return functions, is thoughtful.

My review includes a few minor suggestions to improve maintainability and consistency:

  • Aligning version specifiers for the new tree-sitter-c dependency in pyproject.toml.
  • Removing docstrings to adhere to project conventions.
  • A suggestion to improve clarity in the constants file by aliasing reused C++ node types for the C specification.

Comment on lines +2759 to +2764
FQN_C_SCOPE_TYPES = (
TS_CPP_TRANSLATION_UNIT,
TS_STRUCT_SPECIFIER,
TS_UNION_SPECIFIER,
TS_ENUM_SPECIFIER,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better clarity and maintainability, consider aliasing the reused C++ tree-sitter node types to C-specific names. For example: TS_C_TRANSLATION_UNIT = TS_CPP_TRANSLATION_UNIT. This would make the C configuration more self-documenting and less prone to confusion with the C++ spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dedicated C_NAME_NODE_TYPES constant in constants.py instead of aliasing. It includes TS_STRUCT_SPECIFIER, TS_UNION_SPECIFIER, and TS_ENUM_SPECIFIER, and _c_get_name now references it. Commit 74e10f2.



def _c_unwrap_declarator(declarator: Node) -> Node | None:
"""Unwrap pointer_declarator chains to find the inner function_declarator."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Per the project's general rules, docstrings are not allowed. Please remove this docstring.

References
  1. Docstrings are not allowed in this project, as enforced by a pre-commit hook.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring removed from _c_unwrap_declarator. Commit 74e10f2.



def _c_get_name(node: Node) -> str | None:
"""Get name for C entities, handling pointer-return functions."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Per the project's general rules, docstrings are not allowed. Please remove this docstring.

References
  1. Docstrings are not allowed in this project, as enforced by a pre-commit hook.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring removed from _c_get_name. Commit 74e10f2.

pyproject.toml Outdated
"tree-sitter-go>=0.23.4",
"tree-sitter-scala>=0.24.0",
"tree-sitter-java>=0.23.5",
"tree-sitter-c>=0.21.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The version specifier for tree-sitter-c here (>=0.21.0) is inconsistent with the one in the main dependencies list (>=0.24.1). To avoid confusion and potential dependency issues, it's best to use the same, more restrictive version in both places for consistency.

Suggested change
"tree-sitter-c>=0.21.0",
"tree-sitter-c>=0.24.1",

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed tree-sitter-c from core dependencies entirely and updated the treesitter-full extra entry to >=0.24.1, consistent with the rest of the tree-sitter grammars pattern. Commit 74e10f2.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR wires up full C language support through tree-sitter-c, following the same plumbing pattern used for every other language in the project (SupportedLanguage, TreeSitterModule, LanguageSpec, FQNSpec, LanguageImport). The notable addition is _c_unwrap_declarator, which walks pointer_declarator chains so that pointer-return functions (e.g. br_pixelmap *BrPixelmapAllocate(...)) are named correctly instead of being silently dropped.

Key issues found:

  • Type annotation bug (language_spec.py): _c_unwrap_declarator declares its parameter as Node, but the call site passes the result of child_by_field_name, which is Node | None. The runtime while declarator and ... guard is safe, but ty check will flag the signature. The parameter type should be Node | None.
  • union_specifier gap (language_spec.py): _c_get_name dispatches on CPP_NAME_NODE_TYPES, a C++-specific constant that intentionally omits TS_UNION_SPECIFIER. Union nodes therefore silently fall through to _generic_get_name. This works in practice but is fragile; a dedicated C_NAME_NODE_TYPES constant including TS_UNION_SPECIFIER should be defined and used instead.
  • Docstrings & comment policy (language_spec.py): _c_unwrap_declarator and _c_get_name have docstrings, and line 116 has a trailing inline comment without an (H) prefix — all three violate the project's "no comments or docstrings" coding standard.
  • pyproject.toml version inconsistency: tree-sitter-c is listed in both the core dependencies (>=0.24.1) and the treesitter-full extra (>=0.21.0). Every other tree-sitter grammar is extras-only; the duplicate looser bound in treesitter-full is dead code and conflicts with the project's packaging pattern.

Confidence Score: 3/5

  • Safe to merge after fixing the type annotation, union_specifier gap, and pyproject.toml version inconsistency; the style violations are low-risk but should be resolved to pass pre-commit hooks.
  • The core architecture is sound and follows established patterns. However, the type annotation mismatch will cause ty check failures, the CPP_NAME_NODE_TYPES reuse silently mishandles union names, and the duplicate tree-sitter-c entries in pyproject.toml deviate from the project's dependency convention. None of these are runtime crashes, but they collectively reduce confidence below a passing bar.
  • codebase_rag/language_spec.py (type annotation, union coverage, comment policy) and pyproject.toml (duplicate/inconsistent version bounds).

Important Files Changed

Filename Overview
codebase_rag/language_spec.py Adds C-specific FQN spec and LanguageSpec; introduces _c_unwrap_declarator / _c_get_name helpers with a type annotation bug (Node instead of `Node
codebase_rag/constants.py Adds EXT_C, C_EXTENSIONS, SupportedLanguage.C, TreeSitterModule.C, LANGUAGE_METADATA entry, and all C-specific node-type tuples; changes look correct and consistent with the existing pattern.
pyproject.toml Adds tree-sitter-c as a core dependency (>=0.24.1) AND again in the treesitter-full extra (>=0.21.0); deviates from the pattern of all other tree-sitter grammars which are extras-only, and the conflicting version bounds are confusing.
codebase_rag/parser_loader.py Adds LanguageImport for C using TreeSitterModule.C; follows the exact same pattern as every other language loader entry.
codebase_rag/tools/semantic_search.py Adds description= keyword to two Tool instantiations; unrelated to C language support but a harmless improvement.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[C source file .c] --> B{Extension lookup}
    B --> C[SupportedLanguage.C]
    C --> D[LanguageSpec for C]
    D --> E[tree-sitter-c parser\nvia TreeSitterModule.C]
    E --> F[Parse AST]
    F --> G{Node type?}
    G -->|function_definition| H[_c_get_name]
    G -->|struct/union/enum_specifier| I[C_FQN_SPEC / _c_get_name]
    G -->|call_expression| J[call_node indexing]
    G -->|preproc_include| K[import indexing]
    H --> L{Has pointer_declarator?}
    L -->|Yes| M[_c_unwrap_declarator\nwalk chain to function_declarator]
    L -->|No| N[Read declarator.declarator directly]
    M --> O[Extract identifier name]
    N --> O
    I --> P{node.type in CPP_NAME_NODE_TYPES?}
    P -->|struct_specifier / enum_specifier| Q[child_by_field_name name]
    P -->|union_specifier - NOT in tuple| R[_generic_get_name fallback]
    Q --> S[FQN resolved]
    R --> S
    O --> S
Loading

Last reviewed commit: 4a84211

Comment on lines +100 to +120
def _c_unwrap_declarator(declarator: Node) -> Node | None:
"""Unwrap pointer_declarator chains to find the inner function_declarator."""
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator


def _c_get_name(node: Node) -> str | None:
"""Get name for C entities, handling pointer-return functions."""
if node.type in cs.CPP_NAME_NODE_TYPES:
name_node = node.child_by_field_name(cs.FIELD_NAME)
if name_node and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
elif node.type == cs.TS_CPP_FUNCTION_DEFINITION:
declarator = node.child_by_field_name(cs.FIELD_DECLARATOR)
declarator = _c_unwrap_declarator(declarator)
if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: # "function_declarator"
name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
return _generic_get_name(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings and inline comment violate coding standards

The project's coding standard explicitly forbids docstrings and inline comments unless they carry the (H) marker. Three violations appear in this block:

  1. The docstring on _c_unwrap_declarator (line 101)
  2. The docstring on _c_get_name (line 108)
  3. The trailing comment # "function_declarator" on line 116

All three should be removed. The function names and code should be self-documenting per the project's style guide.

Suggested change
def _c_unwrap_declarator(declarator: Node) -> Node | None:
"""Unwrap pointer_declarator chains to find the inner function_declarator."""
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
def _c_get_name(node: Node) -> str | None:
"""Get name for C entities, handling pointer-return functions."""
if node.type in cs.CPP_NAME_NODE_TYPES:
name_node = node.child_by_field_name(cs.FIELD_NAME)
if name_node and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
elif node.type == cs.TS_CPP_FUNCTION_DEFINITION:
declarator = node.child_by_field_name(cs.FIELD_DECLARATOR)
declarator = _c_unwrap_declarator(declarator)
if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: # "function_declarator"
name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
return _generic_get_name(node)
def _c_unwrap_declarator(declarator: Node | None) -> Node | None:
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
def _c_get_name(node: Node) -> str | None:
if node.type in cs.CPP_NAME_NODE_TYPES:
name_node = node.child_by_field_name(cs.FIELD_NAME)
if name_node and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
elif node.type == cs.TS_CPP_FUNCTION_DEFINITION:
declarator = node.child_by_field_name(cs.FIELD_DECLARATOR)
declarator = _c_unwrap_declarator(declarator)
if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR:
name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
return _generic_get_name(node)
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/language_spec.py
Line: 100-120

Comment:
**Docstrings and inline comment violate coding standards**

The project's coding standard explicitly forbids docstrings and inline comments unless they carry the `(H)` marker. Three violations appear in this block:

1. The docstring on `_c_unwrap_declarator` (line 101)
2. The docstring on `_c_get_name` (line 108)
3. The trailing comment `# "function_declarator"` on line 116

All three should be removed. The function names and code should be self-documenting per the project's style guide.

```suggestion
def _c_unwrap_declarator(declarator: Node | None) -> Node | None:
    while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
        declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
    return declarator

def _c_get_name(node: Node) -> str | None:
    if node.type in cs.CPP_NAME_NODE_TYPES:
        name_node = node.child_by_field_name(cs.FIELD_NAME)
        if name_node and name_node.text:
            return name_node.text.decode(cs.ENCODING_UTF8)
    elif node.type == cs.TS_CPP_FUNCTION_DEFINITION:
        declarator = node.child_by_field_name(cs.FIELD_DECLARATOR)
        declarator = _c_unwrap_declarator(declarator)
        if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR:
            name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
            if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text:
                return name_node.text.decode(cs.ENCODING_UTF8)
    return _generic_get_name(node)
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three removed: docstrings on both functions and the trailing inline comment on the function_declarator check. Commit 74e10f2.

Comment on lines +100 to +104
def _c_unwrap_declarator(declarator: Node) -> Node | None:
"""Unwrap pointer_declarator chains to find the inner function_declarator."""
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation mismatch: declarator parameter should accept None

child_by_field_name returns Node | None, so declarator in _c_get_name (line 114) is typed Node | None before being passed to _c_unwrap_declarator. The parameter annotation declarator: Node is therefore incorrect — ty check will flag this. The runtime while declarator and ... guard already handles None, so only the annotation needs fixing.

Suggested change
def _c_unwrap_declarator(declarator: Node) -> Node | None:
"""Unwrap pointer_declarator chains to find the inner function_declarator."""
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
def _c_unwrap_declarator(declarator: Node | None) -> Node | None:
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/language_spec.py
Line: 100-104

Comment:
**Type annotation mismatch: `declarator` parameter should accept `None`**

`child_by_field_name` returns `Node | None`, so `declarator` in `_c_get_name` (line 114) is typed `Node | None` before being passed to `_c_unwrap_declarator`. The parameter annotation `declarator: Node` is therefore incorrect — `ty check` will flag this. The runtime `while declarator and ...` guard already handles `None`, so only the annotation needs fixing.

```suggestion
def _c_unwrap_declarator(declarator: Node | None) -> Node | None:
    while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
        declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
    return declarator
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: parameter type is now Node | None. Commit 74e10f2.

Comment on lines +107 to +120
def _c_get_name(node: Node) -> str | None:
"""Get name for C entities, handling pointer-return functions."""
if node.type in cs.CPP_NAME_NODE_TYPES:
name_node = node.child_by_field_name(cs.FIELD_NAME)
if name_node and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
elif node.type == cs.TS_CPP_FUNCTION_DEFINITION:
declarator = node.child_by_field_name(cs.FIELD_DECLARATOR)
declarator = _c_unwrap_declarator(declarator)
if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: # "function_declarator"
name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
return _generic_get_name(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

union_specifier names silently fall through to _generic_get_name

CPP_NAME_NODE_TYPES is defined in constants.py (line 2656–2660) as:

CPP_NAME_NODE_TYPES = (
    CppNodeType.CLASS_SPECIFIER,
    TS_STRUCT_SPECIFIER,
    TS_ENUM_SPECIFIER,
)

TS_UNION_SPECIFIER is not in this tuple. Because _c_get_name delegates to _generic_get_name for any node that is neither in CPP_NAME_NODE_TYPES nor a function_definition, union nodes take the generic path instead of the explicit struct/enum path. While _generic_get_name likely resolves the name field correctly in practice, it is fragile and confusing: _c_get_name is using a constant explicitly named for C++ that intentionally omits unions.

A dedicated C_NAME_NODE_TYPES constant should be defined in constants.py that includes TS_UNION_SPECIFIER:

# in constants.py
C_NAME_NODE_TYPES = (
    TS_STRUCT_SPECIFIER,
    TS_UNION_SPECIFIER,
    TS_ENUM_SPECIFIER,
)

and _c_get_name should reference cs.C_NAME_NODE_TYPES instead of cs.CPP_NAME_NODE_TYPES.

Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/language_spec.py
Line: 107-120

Comment:
**`union_specifier` names silently fall through to `_generic_get_name`**

`CPP_NAME_NODE_TYPES` is defined in `constants.py` (line 2656–2660) as:

```python
CPP_NAME_NODE_TYPES = (
    CppNodeType.CLASS_SPECIFIER,
    TS_STRUCT_SPECIFIER,
    TS_ENUM_SPECIFIER,
)
```

`TS_UNION_SPECIFIER` is **not** in this tuple. Because `_c_get_name` delegates to `_generic_get_name` for any node that is neither in `CPP_NAME_NODE_TYPES` nor a `function_definition`, union nodes take the generic path instead of the explicit struct/enum path. While `_generic_get_name` likely resolves the `name` field correctly in practice, it is fragile and confusing: `_c_get_name` is using a constant explicitly named for C++ that intentionally omits unions.

A dedicated `C_NAME_NODE_TYPES` constant should be defined in `constants.py` that includes `TS_UNION_SPECIFIER`:

```python
# in constants.py
C_NAME_NODE_TYPES = (
    TS_STRUCT_SPECIFIER,
    TS_UNION_SPECIFIER,
    TS_ENUM_SPECIFIER,
)
```

and `_c_get_name` should reference `cs.C_NAME_NODE_TYPES` instead of `cs.CPP_NAME_NODE_TYPES`.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added C_NAME_NODE_TYPES to constants.py with TS_STRUCT_SPECIFIER, TS_UNION_SPECIFIER, and TS_ENUM_SPECIFIER. _c_get_name now uses cs.C_NAME_NODE_TYPES so union nodes are handled explicitly. Commit 74e10f2.

pyproject.toml Outdated
"protobuf>=5.27.0",
"defusedxml>=0.7.1",
"huggingface-hub[hf-xet]>=0.36.0",
"tree-sitter-c>=0.24.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tree-sitter-c version bounds are inconsistent between core and extras

The package appears in two places with different minimum versions:

  • Core dependencies (line 55): tree-sitter-c>=0.24.1
  • treesitter-full extra (line 85): tree-sitter-c>=0.21.0

Every other language parser (tree-sitter-cpp, tree-sitter-java, etc.) lives only in the treesitter-full extra. Adding tree-sitter-c as a core dependency forces it to be installed for all users regardless of which extras they choose. The looser >=0.21.0 bound in the extra is also dead code since the core pinning already constrains resolution to >=0.24.1.

Either remove the duplicate from treesitter-full and leave only the core entry, or — following the existing pattern — move tree-sitter-c entirely to treesitter-full with a single consistent version bound.

Suggested change
"tree-sitter-c>=0.24.1",
"tree-sitter-c>=0.24.1",

(and remove the >=0.21.0 entry from treesitter-full)

Prompt To Fix With AI
This is a comment left during a code review.
Path: pyproject.toml
Line: 55

Comment:
**`tree-sitter-c` version bounds are inconsistent between core and extras**

The package appears in two places with different minimum versions:

- Core dependencies (line 55): `tree-sitter-c>=0.24.1`
- `treesitter-full` extra (line 85): `tree-sitter-c>=0.21.0`

Every other language parser (`tree-sitter-cpp`, `tree-sitter-java`, etc.) lives **only** in the `treesitter-full` extra. Adding `tree-sitter-c` as a core dependency forces it to be installed for all users regardless of which extras they choose. The looser `>=0.21.0` bound in the extra is also dead code since the core pinning already constrains resolution to `>=0.24.1`.

Either remove the duplicate from `treesitter-full` and leave only the core entry, or — following the existing pattern — move `tree-sitter-c` entirely to `treesitter-full` with a single consistent version bound.

```suggestion
    "tree-sitter-c>=0.24.1",
```
(and remove the `>=0.21.0` entry from `treesitter-full`)

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed tree-sitter-c from core dependencies and updated the treesitter-full extra to a single >=0.24.1 entry, matching the pattern of every other tree-sitter grammar. Commit 74e10f2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

C language support?

1 participant