fix: build explicit PyArrow schema in Neo4jGraphParquetFormatter by matteomedioli · Pull Request #500 · neo4j/neo4j-graphrag-python

matteomedioli · 2026-03-30T17:07:55Z

Description

pa.Table.from_pylist(rows) infers the Parquet schema from the first row it encounters. When the first node or relationship row has no embedding_properties, the embedding column is never created and all embedding values from subsequent rows are silently discarded. According to here:

neo4j-graphrag-python/src/neo4j_graphrag/experimental/components/lexical_graph.py

Line 138 in 3d092f6

embedding_properties = {}

in case of null or empty embeddings, the default value of embedding_properties is empty dict {}.

Minimal reproduction code:

import pyarrow.parquet as pq
from io import BytesIO
from neo4j_graphrag.experimental.components.parquet_formatter import Neo4jGraphParquetFormatter                                                                                            
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig, Neo4jGraph, Neo4jNode
                                                                                                                                                                                           
formatter = Neo4jGraphParquetFormatter()                  
config = LexicalGraphConfig()                                                                                                                                                              
                                                          
# batch 1 FAILED → no embedding_properties → first in node list
# batch 2 SUCCEEDED → embedding_properties present
nodes = [
    Neo4jNode(id='c0', label='Chunk', properties={'text': 'a'}, embedding_properties={}),
    Neo4jNode(id='c1', label='Chunk', properties={'text': 'b'}, embedding_properties={}),                                                                                                  
    Neo4jNode(id='c2', label='Chunk', properties={'text': 'c'}, embedding_properties={'embedding': [0.1, 0.2, 0.3]}),                                                                      
    Neo4jNode(id='c3', label='Chunk', properties={'text': 'd'}, embedding_properties={'embedding': [0.4, 0.5, 0.6]}),                                                                      
]                                                                                                                                                                                          
graph = Neo4jGraph(nodes=nodes, relationships=[])         
data, _, _ = formatter.format_graph(graph, config)                                                                                                                                         
table = pq.read_table(BytesIO(list(data['nodes'].values())[0]))                                                                                                                            
print(table.schema.names)  # ['__id__', 'labels', 'text'] — 'embedding' silently dropped
print(table.to_pylist())   # c2 and c3 have NO embedding — successful embeddings lost

The fix builds an explicit pa.schema from the union of all keys across every row before calling from_pylist. Missing values are filled with null, and float-list columns (e.g. embeddings) are normalised to list<float32> (default type for python float list) regardless of which row provides the first non-null sample.

Type of Change

Bug fix

Complexity:

Low

How Has This Been Tested?

Unit tests
E2E tests
Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

Documentation has been updated
Unit tests have been updated
E2E tests have been updated
Examples have been updated
New files have copyright header
CLA (https://neo4j.com/developer/cla/) has been signed
CHANGELOG.md updated if appropriate

…revent silent embedding column drops

matteomedioli marked this pull request as ready for review March 31, 2026 13:45

matteomedioli requested a review from a team as a code owner March 31, 2026 13:45

matteomedioli force-pushed the matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065 branch 2 times, most recently from 3582081 to a0c03dc Compare April 1, 2026 14:47

matteomedioli added 3 commits April 1, 2026 16:48

fix: build explicit PyArrow schema in Neo4jGraphParquetFormatter to p…

66f2045

…revent silent embedding column drops

add: changelog

cd9814d

fix: list of float revert float32

05939ab

matteomedioli force-pushed the matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065 branch from a0c03dc to 05939ab Compare April 1, 2026 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: build explicit PyArrow schema in Neo4jGraphParquetFormatter#500

fix: build explicit PyArrow schema in Neo4jGraphParquetFormatter#500
matteomedioli wants to merge 3 commits intomainfrom
matteo/parquet-formatter-embedding-schema-fox-GENKGB-1065

matteomedioli commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matteomedioli commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Complexity:

How Has This Been Tested?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matteomedioli commented Mar 30, 2026 •

edited

Loading