Skip to content

Save Parquet File for Single-cells during Colony Simulation without MongoDB? #389

@katha815

Description

@katha815

Problem Description

I am using ecoli_engine_process.py for colony simulation under baseline conditions. While I can successfully save snapshots of the colony state, I continuously encounter errors when trying to emit Parquet files for single-cell data. I wonder if anyone has successfully done this without an online emitter?

The errors and attempted debugging steps are shown below. I noticed the implemented antibiotic simulation uses an online database as the emitter, so I included observations about that in the last section.

It could be a different underlying reason why my simulation failed, and I would appreciate any insights or suggestions!

Environment

  • vEcoli commit: (current main branch)
  • Python 3.12
  • Config: spatial.json inheritance with "emitter": "parquet"

Goal

Resume a colony simulation from a saved JSON state file (baseline_2gen_seed_0_colony_t6000.json) and output to Parquet files.

My Configuration File

configs/colony_baseline_test2.json:

{
    "inherit_from": ["spatial.json"],
    "description": "Test2: read from trial1, run for another generation, save parquet per cell, save final state",
    "initial_colony_file": "baseline_2gen_seed_0_colony_t6000",

    "seed": 0,
    "sim_data_path": "out/all_media_conditions1/parca/kb/simData.cPickle",
    
    "emitter": "parquet",
    "emitter_arg": {
        "out_dir": "out/colony_runs/baseline_3rd_gen_seed_0"
    },
    "emit_config": false,

    "max_duration": 9000,
    
    "save": true,
    "save_times": [9000],
    "colony_save_prefix": "baseline_3rd_gen",
    
    "parallel": false,
    
    "engine_process_reports": [
        ["boundary"],
        ["bulk"],
        ["listeners"],
        ["environment", "exchange"]
    ]
}

Issue 1: NumpyRandomStateSerializer Serialization/Deserialization Mismatch

Command:

python ecoli/experiments/ecoli_engine_process.py --config configs/colony_baseline_test2.json

Error:

Traceback (most recent call last):
  File "ecoli/experiments/ecoli_engine_process.py", line 526, in <module>
    run_simulation(config)
  File "ecoli/experiments/ecoli_engine_process.py", line 389, in run_simulation
    initial_state = get_state_from_file(...)
  File "ecoli/library/json_state.py", line 168, in get_state_from_file
    return json.loads(f.read(), object_hook=custom_decoder)
  File "ecoli/library/serialize.py", line 83, in deserialize
    data = orjson.loads(data)
orjson.JSONDecodeError: unexpected character: line 1 column 1 (char 0)

Possible Root Cause:

  • serialize() at line 70-72 appears to output Python tuple format: ('MT19937', [...])
  • deserialize() at line 82 uses orjson.loads() which expects JSON array format: ["MT19937", [...]]

File: ecoli/library/serialize.py

Original code (lines 78-85):

def deserialize(self, data):
    matched_regex = self.regex_for_serialized.fullmatch(data)
    if matched_regex:
        data = matched_regex.group(1)
    data = orjson.loads(data)
    rng = np.random.RandomState()
    rng.set_state(data)
    return rng

Attempted Fix: Replace orjson.loads() with ast.literal_eval() to handle Python tuple format:

def deserialize(self, data):
    import ast
    matched_regex = self.regex_for_serialized.fullmatch(data)
    if matched_regex:
        data = matched_regex.group(1)
    if data.startswith("("):
        data = ast.literal_eval(data)
    else:
        data = orjson.loads(data)
    rng = np.random.RandomState()
    rng.set_state(tuple(data))
    return rng

Issue 2: Parquet Emitter Cannot Handle pint.Quantity in Config Metadata

Error (after attempting to fix Issue 1):

Traceback (most recent call last):
  File "ecoli/library/parquet_emitter.py", line 963, in emit
    v = np.asarray(v, dtype=np_dtype(v, k))
  File "ecoli/library/parquet_emitter.py", line 706, in np_dtype
    raise ValueError(f"{field_name} has unsupported type {type(val)}.")
ValueError: spatial_environment_config__multibody__bounds has unsupported type <class 'pint.Quantity'>.

During handling of the above exception, another exception occurred:
  File "ecoli/library/parquet_emitter.py", line 967, in emit
    v = pl.Series([v])
TypeError: not yet implemented: Nested object types

Possible Root Cause:

  • spatial.json contains pint.Quantity values (e.g., "!units[50 micrometer]")
  • np_dtype() doesn't appear to handle pint.Quantity, and falls back to Polars
  • pl.Series([v]) also seems unable to handle pint.Quantity objects

File: ecoli/library/parquet_emitter.py

Original code (line 967):

v = pl.Series([v])

Attempted Fix: Convert unsupported types to string:

v = pl.Series([str(v)])

Issue 3: emit_config Setting Not Passed to Engine

Note: Even after the str(v) fix, I attempted to disable config emission via JSON config "emit_config": false, but it appeared to have no effect.

Possible Root Cause:
ecoli_engine_process.py does not seem to pass the emit_config parameter to the Engine constructor.

File: ecoli/experiments/ecoli_engine_process.py

Original code (around line 465):

engine = Engine(
    processes=composite.processes,
    topology=composite.topology,
    initial_state=initial_state,
    experiment_id=experiment_id,
    emitter=emitter_config,
    progress_bar=config["progress_bar"],
    metadata=metadata,
    profile=config["profile"],
    initial_global_time=config.get("start_time", 0.0),
)

Attempted Fix: Add emit_config parameter:

engine = Engine(
    ...
    initial_global_time=config.get("start_time", 0.0),
    emit_config=config.get("emit_config", False),
)

Issue 4: Parquet Emitter Assumes agents Key in Data Structure

Error (after attempting to fix Issues 1-3):

Traceback (most recent call last):
  File "ecoli/experiments/ecoli_engine_process.py", line 485, in run_simulation
    colony_save_states(engine, config)
  File "ecoli/experiments/ecoli_engine_process.py", line 255, in colony_save_states
    engine.update(time_to_next_save)
  ...
  File "ecoli/processes/engine_process.py", line 505, in next_update
    self.emitter.emit(emit_config)
  File "ecoli/library/parquet_emitter.py", line 1007, in emit
    if len(data["data"]["agents"]) > 1:
KeyError: 'agents'

Possible Root Cause:

  • ParquetEmitter.emit() appears to expect a data["data"]["agents"] structure (outer simulation)
  • EngineProcess inner emitter seems to send cell data directly without the agents wrapper
  • The inner emitter is configured via inner_emitter in the EngineProcess config

File: ecoli/library/parquet_emitter.py (line 1007)


Observations on Implemented Antibiotic Simulation

The tet_amp_sim.py uses a different configuration that appears to avoid these issues:

# From configs/cloud.json (inherited by antibiotics.json)
{
    "emitter": "database",  # MongoDB, not Parquet
    "emitter_arg": {"host": "10.138.0.75:27017", "emit_limit": 4100000}
}
  • MongoDB can handle arbitrary Python objects including pint.Quantity
  • No JSON serialization issues with NumpyRandomState
  • No agents key structure assumptions

Summary Table

Issue File Line Status
1. RandomState serialize/deserialize mismatch serialize.py 78-85 Attempted fix with ast.literal_eval()
2. pint.Quantity not supported parquet_emitter.py 967 Attempted fix with str(v)
3. emit_config not passed to Engine ecoli_engine_process.py ~470 Attempted fix by adding parameter
4. Missing agents key handling parquet_emitter.py 1007 UNRESOLVED

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions