Skip to content

fix: build explicit PyArrow schema in Neo4jGraphParquetFormatter#500

Open
matteomedioli wants to merge 3 commits intomainfrom
matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065
Open

fix: build explicit PyArrow schema in Neo4jGraphParquetFormatter#500
matteomedioli wants to merge 3 commits intomainfrom
matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065

Conversation

@matteomedioli
Copy link
Copy Markdown

@matteomedioli matteomedioli commented Mar 30, 2026

Description

pa.Table.from_pylist(rows) infers the Parquet schema from the first row it encounters. When the first node or relationship row has no embedding_properties, the embedding column is never created and all embedding values from subsequent rows are silently discarded. According to here:

in case of null or empty embeddings, the default value of embedding_properties is empty dict {}.

Minimal reproduction code:

import pyarrow.parquet as pq
from io import BytesIO
from neo4j_graphrag.experimental.components.parquet_formatter import Neo4jGraphParquetFormatter                                                                                            
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig, Neo4jGraph, Neo4jNode
                                                                                                                                                                                           
formatter = Neo4jGraphParquetFormatter()                  
config = LexicalGraphConfig()                                                                                                                                                              
                                                          
# batch 1 FAILED → no embedding_properties → first in node list
# batch 2 SUCCEEDED → embedding_properties present
nodes = [
    Neo4jNode(id='c0', label='Chunk', properties={'text': 'a'}, embedding_properties={}),
    Neo4jNode(id='c1', label='Chunk', properties={'text': 'b'}, embedding_properties={}),                                                                                                  
    Neo4jNode(id='c2', label='Chunk', properties={'text': 'c'}, embedding_properties={'embedding': [0.1, 0.2, 0.3]}),                                                                      
    Neo4jNode(id='c3', label='Chunk', properties={'text': 'd'}, embedding_properties={'embedding': [0.4, 0.5, 0.6]}),                                                                      
]                                                                                                                                                                                          
graph = Neo4jGraph(nodes=nodes, relationships=[])         
data, _, _ = formatter.format_graph(graph, config)                                                                                                                                         
table = pq.read_table(BytesIO(list(data['nodes'].values())[0]))                                                                                                                            
print(table.schema.names)  # ['__id__', 'labels', 'text'] — 'embedding' silently dropped
print(table.to_pylist())   # c2 and c3 have NO embedding — successful embeddings lost                                                                                                      

The fix builds an explicit pa.schema from the union of all keys across every row before calling from_pylist. Missing values are filled with null, and float-list columns (e.g. embeddings) are normalised to list<float32> (default type for python float list) regardless of which row provides the first non-null sample.

Type of Change

  • Bug fix

Complexity:

  • Low

How Has This Been Tested?

  • Unit tests
  • E2E tests
  • Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

  • Documentation has been updated
  • Unit tests have been updated
  • E2E tests have been updated
  • Examples have been updated
  • New files have copyright header
  • CLA (https://neo4j.com/developer/cla/) has been signed
  • CHANGELOG.md updated if appropriate

@matteomedioli matteomedioli marked this pull request as ready for review March 31, 2026 13:45
@matteomedioli matteomedioli requested a review from a team as a code owner March 31, 2026 13:45
@matteomedioli matteomedioli force-pushed the matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065 branch 2 times, most recently from 3582081 to a0c03dc Compare April 1, 2026 14:47
@matteomedioli matteomedioli force-pushed the matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065 branch from a0c03dc to 05939ab Compare April 1, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant