feat: add C language support via tree-sitter-c#433
feat: add C language support via tree-sitter-c#433dj0nes wants to merge 4 commits intovitali87:mainfrom
Conversation
pydantic-ai uses function docstrings as the tool description field. Without a docstring, it sends null, which LM Studio's OpenAI-compatible API rejects with: tools.N.type: invalid_string. Add docstrings to semantic_search_functions and get_function_source_by_id so both tools have a valid description string.
Use the existing td.SEMANTIC_SEARCH and td.GET_FUNCTION_SOURCE constants as explicit description= arguments to the Tool() constructor, consistent with every other tool factory in the codebase. Remove docstrings added in the previous commit, which violated the project no-docstrings rule. This ensures LM Studio and other strict OpenAI-compatible backends receive a valid non-null description field in the tool schema.
Adds basic C language parsing support, resolving issue vitali87#128. - Add SupportedLanguage.C and TreeSitterModule.C to constants - Add C_EXTENSIONS (.c only), node type tuples, and LANGUAGE_METADATA - Add _c_get_name() in language_spec.py that correctly unwraps pointer_declarator chains for pointer-return functions like `br_pixelmap *BrPixelmapAllocate(...)` - Add C_FQN_SPEC and LANGUAGE_SPECS[C] with function/class/call queries - Add tree-sitter-c LanguageImport in parser_loader.py - Add tree-sitter-c>=0.24.1 dependency - Update test_language_node_coverage.py and test_handler_registry.py to cover the C language Note: .h files remain parsed as C++ by default. Files using calling convention macros between return type and function name (e.g. Watcom C's BR_RESIDENT_ENTRY) will produce ERROR nodes and won't be indexed — this is a known tree-sitter limitation with unexpanded macros.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the codebase by integrating full C language support. It introduces the necessary infrastructure to parse, index, and analyze C source files using Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the C programming language, a significant and well-implemented feature. The changes are comprehensive, covering constants, language specifications, parser loading, and adding corresponding tests. The approach to handle C-specific syntax, like pointer-return functions, is thoughtful.
My review includes a few minor suggestions to improve maintainability and consistency:
- Aligning version specifiers for the new
tree-sitter-cdependency inpyproject.toml. - Removing docstrings to adhere to project conventions.
- A suggestion to improve clarity in the constants file by aliasing reused C++ node types for the C specification.
| FQN_C_SCOPE_TYPES = ( | ||
| TS_CPP_TRANSLATION_UNIT, | ||
| TS_STRUCT_SPECIFIER, | ||
| TS_UNION_SPECIFIER, | ||
| TS_ENUM_SPECIFIER, | ||
| ) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Added a dedicated C_NAME_NODE_TYPES constant in constants.py instead of aliasing. It includes TS_STRUCT_SPECIFIER, TS_UNION_SPECIFIER, and TS_ENUM_SPECIFIER, and _c_get_name now references it. Commit 74e10f2.
codebase_rag/language_spec.py
Outdated
|
|
||
|
|
||
| def _c_unwrap_declarator(declarator: Node) -> Node | None: | ||
| """Unwrap pointer_declarator chains to find the inner function_declarator.""" |
There was a problem hiding this comment.
Docstring removed from _c_unwrap_declarator. Commit 74e10f2.
codebase_rag/language_spec.py
Outdated
|
|
||
|
|
||
| def _c_get_name(node: Node) -> str | None: | ||
| """Get name for C entities, handling pointer-return functions.""" |
pyproject.toml
Outdated
| "tree-sitter-go>=0.23.4", | ||
| "tree-sitter-scala>=0.24.0", | ||
| "tree-sitter-java>=0.23.5", | ||
| "tree-sitter-c>=0.21.0", |
There was a problem hiding this comment.
The version specifier for tree-sitter-c here (>=0.21.0) is inconsistent with the one in the main dependencies list (>=0.24.1). To avoid confusion and potential dependency issues, it's best to use the same, more restrictive version in both places for consistency.
| "tree-sitter-c>=0.21.0", | |
| "tree-sitter-c>=0.24.1", |
There was a problem hiding this comment.
Removed tree-sitter-c from core dependencies entirely and updated the treesitter-full extra entry to >=0.24.1, consistent with the rest of the tree-sitter grammars pattern. Commit 74e10f2.
Greptile SummaryThis PR wires up full C language support through tree-sitter-c, following the same plumbing pattern used for every other language in the project ( Key issues found:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[C source file .c] --> B{Extension lookup}
B --> C[SupportedLanguage.C]
C --> D[LanguageSpec for C]
D --> E[tree-sitter-c parser\nvia TreeSitterModule.C]
E --> F[Parse AST]
F --> G{Node type?}
G -->|function_definition| H[_c_get_name]
G -->|struct/union/enum_specifier| I[C_FQN_SPEC / _c_get_name]
G -->|call_expression| J[call_node indexing]
G -->|preproc_include| K[import indexing]
H --> L{Has pointer_declarator?}
L -->|Yes| M[_c_unwrap_declarator\nwalk chain to function_declarator]
L -->|No| N[Read declarator.declarator directly]
M --> O[Extract identifier name]
N --> O
I --> P{node.type in CPP_NAME_NODE_TYPES?}
P -->|struct_specifier / enum_specifier| Q[child_by_field_name name]
P -->|union_specifier - NOT in tuple| R[_generic_get_name fallback]
Q --> S[FQN resolved]
R --> S
O --> S
Last reviewed commit: 4a84211 |
codebase_rag/language_spec.py
Outdated
| def _c_unwrap_declarator(declarator: Node) -> Node | None: | ||
| """Unwrap pointer_declarator chains to find the inner function_declarator.""" | ||
| while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR: | ||
| declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | ||
| return declarator | ||
|
|
||
|
|
||
| def _c_get_name(node: Node) -> str | None: | ||
| """Get name for C entities, handling pointer-return functions.""" | ||
| if node.type in cs.CPP_NAME_NODE_TYPES: | ||
| name_node = node.child_by_field_name(cs.FIELD_NAME) | ||
| if name_node and name_node.text: | ||
| return name_node.text.decode(cs.ENCODING_UTF8) | ||
| elif node.type == cs.TS_CPP_FUNCTION_DEFINITION: | ||
| declarator = node.child_by_field_name(cs.FIELD_DECLARATOR) | ||
| declarator = _c_unwrap_declarator(declarator) | ||
| if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: # "function_declarator" | ||
| name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | ||
| if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text: | ||
| return name_node.text.decode(cs.ENCODING_UTF8) | ||
| return _generic_get_name(node) |
There was a problem hiding this comment.
Docstrings and inline comment violate coding standards
The project's coding standard explicitly forbids docstrings and inline comments unless they carry the (H) marker. Three violations appear in this block:
- The docstring on
_c_unwrap_declarator(line 101) - The docstring on
_c_get_name(line 108) - The trailing comment
# "function_declarator"on line 116
All three should be removed. The function names and code should be self-documenting per the project's style guide.
| def _c_unwrap_declarator(declarator: Node) -> Node | None: | |
| """Unwrap pointer_declarator chains to find the inner function_declarator.""" | |
| while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR: | |
| declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | |
| return declarator | |
| def _c_get_name(node: Node) -> str | None: | |
| """Get name for C entities, handling pointer-return functions.""" | |
| if node.type in cs.CPP_NAME_NODE_TYPES: | |
| name_node = node.child_by_field_name(cs.FIELD_NAME) | |
| if name_node and name_node.text: | |
| return name_node.text.decode(cs.ENCODING_UTF8) | |
| elif node.type == cs.TS_CPP_FUNCTION_DEFINITION: | |
| declarator = node.child_by_field_name(cs.FIELD_DECLARATOR) | |
| declarator = _c_unwrap_declarator(declarator) | |
| if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: # "function_declarator" | |
| name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | |
| if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text: | |
| return name_node.text.decode(cs.ENCODING_UTF8) | |
| return _generic_get_name(node) | |
| def _c_unwrap_declarator(declarator: Node | None) -> Node | None: | |
| while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR: | |
| declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | |
| return declarator | |
| def _c_get_name(node: Node) -> str | None: | |
| if node.type in cs.CPP_NAME_NODE_TYPES: | |
| name_node = node.child_by_field_name(cs.FIELD_NAME) | |
| if name_node and name_node.text: | |
| return name_node.text.decode(cs.ENCODING_UTF8) | |
| elif node.type == cs.TS_CPP_FUNCTION_DEFINITION: | |
| declarator = node.child_by_field_name(cs.FIELD_DECLARATOR) | |
| declarator = _c_unwrap_declarator(declarator) | |
| if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: | |
| name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | |
| if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text: | |
| return name_node.text.decode(cs.ENCODING_UTF8) | |
| return _generic_get_name(node) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/language_spec.py
Line: 100-120
Comment:
**Docstrings and inline comment violate coding standards**
The project's coding standard explicitly forbids docstrings and inline comments unless they carry the `(H)` marker. Three violations appear in this block:
1. The docstring on `_c_unwrap_declarator` (line 101)
2. The docstring on `_c_get_name` (line 108)
3. The trailing comment `# "function_declarator"` on line 116
All three should be removed. The function names and code should be self-documenting per the project's style guide.
```suggestion
def _c_unwrap_declarator(declarator: Node | None) -> Node | None:
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
def _c_get_name(node: Node) -> str | None:
if node.type in cs.CPP_NAME_NODE_TYPES:
name_node = node.child_by_field_name(cs.FIELD_NAME)
if name_node and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
elif node.type == cs.TS_CPP_FUNCTION_DEFINITION:
declarator = node.child_by_field_name(cs.FIELD_DECLARATOR)
declarator = _c_unwrap_declarator(declarator)
if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR:
name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text:
return name_node.text.decode(cs.ENCODING_UTF8)
return _generic_get_name(node)
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
All three removed: docstrings on both functions and the trailing inline comment on the function_declarator check. Commit 74e10f2.
codebase_rag/language_spec.py
Outdated
| def _c_unwrap_declarator(declarator: Node) -> Node | None: | ||
| """Unwrap pointer_declarator chains to find the inner function_declarator.""" | ||
| while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR: | ||
| declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | ||
| return declarator |
There was a problem hiding this comment.
Type annotation mismatch: declarator parameter should accept None
child_by_field_name returns Node | None, so declarator in _c_get_name (line 114) is typed Node | None before being passed to _c_unwrap_declarator. The parameter annotation declarator: Node is therefore incorrect — ty check will flag this. The runtime while declarator and ... guard already handles None, so only the annotation needs fixing.
| def _c_unwrap_declarator(declarator: Node) -> Node | None: | |
| """Unwrap pointer_declarator chains to find the inner function_declarator.""" | |
| while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR: | |
| declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | |
| return declarator | |
| def _c_unwrap_declarator(declarator: Node | None) -> Node | None: | |
| while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR: | |
| declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | |
| return declarator |
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/language_spec.py
Line: 100-104
Comment:
**Type annotation mismatch: `declarator` parameter should accept `None`**
`child_by_field_name` returns `Node | None`, so `declarator` in `_c_get_name` (line 114) is typed `Node | None` before being passed to `_c_unwrap_declarator`. The parameter annotation `declarator: Node` is therefore incorrect — `ty check` will flag this. The runtime `while declarator and ...` guard already handles `None`, so only the annotation needs fixing.
```suggestion
def _c_unwrap_declarator(declarator: Node | None) -> Node | None:
while declarator and declarator.type == cs.CppNodeType.POINTER_DECLARATOR:
declarator = declarator.child_by_field_name(cs.FIELD_DECLARATOR)
return declarator
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed: parameter type is now Node | None. Commit 74e10f2.
| def _c_get_name(node: Node) -> str | None: | ||
| """Get name for C entities, handling pointer-return functions.""" | ||
| if node.type in cs.CPP_NAME_NODE_TYPES: | ||
| name_node = node.child_by_field_name(cs.FIELD_NAME) | ||
| if name_node and name_node.text: | ||
| return name_node.text.decode(cs.ENCODING_UTF8) | ||
| elif node.type == cs.TS_CPP_FUNCTION_DEFINITION: | ||
| declarator = node.child_by_field_name(cs.FIELD_DECLARATOR) | ||
| declarator = _c_unwrap_declarator(declarator) | ||
| if declarator and declarator.type == cs.TS_CPP_FUNCTION_DECLARATOR: # "function_declarator" | ||
| name_node = declarator.child_by_field_name(cs.FIELD_DECLARATOR) | ||
| if name_node and name_node.type == cs.TS_IDENTIFIER and name_node.text: | ||
| return name_node.text.decode(cs.ENCODING_UTF8) | ||
| return _generic_get_name(node) |
There was a problem hiding this comment.
union_specifier names silently fall through to _generic_get_name
CPP_NAME_NODE_TYPES is defined in constants.py (line 2656–2660) as:
CPP_NAME_NODE_TYPES = (
CppNodeType.CLASS_SPECIFIER,
TS_STRUCT_SPECIFIER,
TS_ENUM_SPECIFIER,
)TS_UNION_SPECIFIER is not in this tuple. Because _c_get_name delegates to _generic_get_name for any node that is neither in CPP_NAME_NODE_TYPES nor a function_definition, union nodes take the generic path instead of the explicit struct/enum path. While _generic_get_name likely resolves the name field correctly in practice, it is fragile and confusing: _c_get_name is using a constant explicitly named for C++ that intentionally omits unions.
A dedicated C_NAME_NODE_TYPES constant should be defined in constants.py that includes TS_UNION_SPECIFIER:
# in constants.py
C_NAME_NODE_TYPES = (
TS_STRUCT_SPECIFIER,
TS_UNION_SPECIFIER,
TS_ENUM_SPECIFIER,
)and _c_get_name should reference cs.C_NAME_NODE_TYPES instead of cs.CPP_NAME_NODE_TYPES.
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/language_spec.py
Line: 107-120
Comment:
**`union_specifier` names silently fall through to `_generic_get_name`**
`CPP_NAME_NODE_TYPES` is defined in `constants.py` (line 2656–2660) as:
```python
CPP_NAME_NODE_TYPES = (
CppNodeType.CLASS_SPECIFIER,
TS_STRUCT_SPECIFIER,
TS_ENUM_SPECIFIER,
)
```
`TS_UNION_SPECIFIER` is **not** in this tuple. Because `_c_get_name` delegates to `_generic_get_name` for any node that is neither in `CPP_NAME_NODE_TYPES` nor a `function_definition`, union nodes take the generic path instead of the explicit struct/enum path. While `_generic_get_name` likely resolves the `name` field correctly in practice, it is fragile and confusing: `_c_get_name` is using a constant explicitly named for C++ that intentionally omits unions.
A dedicated `C_NAME_NODE_TYPES` constant should be defined in `constants.py` that includes `TS_UNION_SPECIFIER`:
```python
# in constants.py
C_NAME_NODE_TYPES = (
TS_STRUCT_SPECIFIER,
TS_UNION_SPECIFIER,
TS_ENUM_SPECIFIER,
)
```
and `_c_get_name` should reference `cs.C_NAME_NODE_TYPES` instead of `cs.CPP_NAME_NODE_TYPES`.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Added C_NAME_NODE_TYPES to constants.py with TS_STRUCT_SPECIFIER, TS_UNION_SPECIFIER, and TS_ENUM_SPECIFIER. _c_get_name now uses cs.C_NAME_NODE_TYPES so union nodes are handled explicitly. Commit 74e10f2.
pyproject.toml
Outdated
| "protobuf>=5.27.0", | ||
| "defusedxml>=0.7.1", | ||
| "huggingface-hub[hf-xet]>=0.36.0", | ||
| "tree-sitter-c>=0.24.1", |
There was a problem hiding this comment.
tree-sitter-c version bounds are inconsistent between core and extras
The package appears in two places with different minimum versions:
- Core dependencies (line 55):
tree-sitter-c>=0.24.1 treesitter-fullextra (line 85):tree-sitter-c>=0.21.0
Every other language parser (tree-sitter-cpp, tree-sitter-java, etc.) lives only in the treesitter-full extra. Adding tree-sitter-c as a core dependency forces it to be installed for all users regardless of which extras they choose. The looser >=0.21.0 bound in the extra is also dead code since the core pinning already constrains resolution to >=0.24.1.
Either remove the duplicate from treesitter-full and leave only the core entry, or — following the existing pattern — move tree-sitter-c entirely to treesitter-full with a single consistent version bound.
| "tree-sitter-c>=0.24.1", | |
| "tree-sitter-c>=0.24.1", |
(and remove the >=0.21.0 entry from treesitter-full)
Prompt To Fix With AI
This is a comment left during a code review.
Path: pyproject.toml
Line: 55
Comment:
**`tree-sitter-c` version bounds are inconsistent between core and extras**
The package appears in two places with different minimum versions:
- Core dependencies (line 55): `tree-sitter-c>=0.24.1`
- `treesitter-full` extra (line 85): `tree-sitter-c>=0.21.0`
Every other language parser (`tree-sitter-cpp`, `tree-sitter-java`, etc.) lives **only** in the `treesitter-full` extra. Adding `tree-sitter-c` as a core dependency forces it to be installed for all users regardless of which extras they choose. The looser `>=0.21.0` bound in the extra is also dead code since the core pinning already constrains resolution to `>=0.24.1`.
Either remove the duplicate from `treesitter-full` and leave only the core entry, or — following the existing pattern — move `tree-sitter-c` entirely to `treesitter-full` with a single consistent version bound.
```suggestion
"tree-sitter-c>=0.24.1",
```
(and remove the `>=0.21.0` entry from `treesitter-full`)
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Removed tree-sitter-c from core dependencies and updated the treesitter-full extra to a single >=0.24.1 entry, matching the pattern of every other tree-sitter grammar. Commit 74e10f2.
Closes #128
Summary
SupportedLanguage.Cand the full plumbing needed to index.cfiles with tree-sitter-c_c_get_name()correctly unwrapspointer_declaratorchains so pointer-return functions (e.g.br_pixelmap *BrPixelmapAllocate(...)) are named correctly, unlike a naive port of the C++ approachC_FQN_SPECscopes totranslation_unit,struct_specifier,union_specifier,enum_specifier(no namespaces in C)function_definitiononly (droppeddeclarationwhich is too broad and catches variable/typedef declarations),struct/union/enumfor classes,call_expressionfor calls,preproc_includefor importsLanguageMetadatastatus set toDEV— solid for standard C, known limitation with unexpanded macros (see below)test_language_node_coverage.pyandtest_handler_registry.pyupdated to cover CKnown limitation: calling-convention macros
Files that use macros between the return type and function name (Watcom C, Windows
WINAPI, etc.) produceERRORnodes in tree-sitter because the grammar cannot expand macros. Functions declared this way will not be indexed. A preprocessor pass (e.g.cpp -Porpcpp) before parsing would fix this and is worth a follow-up..hfile handlingHeaders are still routed to the C++ parser by default. A per-directory heuristic (C-only dirs use C parser, mixed dirs use C++ parser) was prototyped in the original issue #128 patch and can be added as a follow-up without changing the core design here.
Test plan
test_language_node_coverage.py-- C in extension mapping, language spec paramstest_handler_registry.py-- C returnsBaseLanguageHandler, has all protocol methodstest_fqn_resolver.py-- no regressionstest_cpp_basic_syntax.py/test_cpp_comprehensive.py-- no regressionsBRender-v1.3.2/core/fw/token.cand extracted 8 functions correctly; pointer-return functions likebr_pixelmap *BrPixelmapAllocate()resolve to the correct name