Skip to content

logically separate codec metadata from codec execution #3884

@d-v-b

Description

@d-v-b

I'm working on adding a feature to zarr-python that will allow us to declare the codec classes we want to use at array-access time. Meaning something like this: zarr.open_array(..., config={"codec_class_map": {**default_codecs, "blosc": MyBloscCodec}). This will make it MUCH easier to play around with different codecs.

Here's a sketch of how this probably should work:

open_array(..., config=config_that_declares_codec_classes)
    # 1. get array metadata
    # 2. create Array class
    #   2.a create chunk encoding machinery based on array metadata
    #     2.a.1 resolve codec metadata into actual codecs that do encoding / decoding.

In this design, the codec classes that actually encode and decode chunks are only introduced when we create the chunk encoding / decode machinery. But in the current codebase, we create functional codec classes that do encoding / decoding when we create our model of array metadata documents (ArrayV3Metadata and ArrayV2Metadata), So the flow today looks like this:

open_array(..., config=config_that_declares_codec_classes)
    # 1. get array metadata
    #   1.a resolve codec metadata into actual codecs that do encoding / decoding.
    # 2. create Array class
    #   2.a create chunk encoding machinery based on array metadata

So the Array class doesn't own the codec classes that do encoding / decoding -- that's owned by the array metadata instead. IMO this is not correct. The array metadata classes should be narrowly scoped to representing array metadata, and the array class should be scoped to all the runtime stuff necessary for materializing the chunks made accessible via an array metadata document and a storage backend.

I'd like to figure out how we can move these chunk encoding / decoding classes off the array metadata documents. Minimally, for each codec we might define lightweight CodecMetadata classes that only exist to facilitate array creation + basic invariant checking. We could even have a totally generic CodecMetadata class that doesn't know anything about the set of codec implementations -- this would allow us to model array metadata documents where the chunks cant be decoded (because we don't have the right codec implementations), but the attributes can be read / written. BTW this is yet another example of why the v3 spec should structurally define the separate codec types, instead of putting them all in flat array.

Open to feedback if people think this direction is not a good idea. We might want to consider doing the same thing for the ZDType classes as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions