Skip to content

Integrate labs/flowsheet query into the workflow #66

@jeremyestein

Description

@jeremyestein

Description

The pure SQL query will need to be run somewhere and the results will need to be processed into parquet files.

Pseudonymisation should be a simple matter of applying the lookup in the hashes summary file, although you can just query the hasher again. We will have to double check that no other pseudonymisation is required.

Suggested implementation

I suggest that the queries are run inside python using psycopg2, as that's what we're already using for the Emap correlation query. It should be controlled by a Snakefile rule so that dependencies are managed correctly (eg. the job doesn't start before the previous one in the chain has finished). It should use the daily hashes.json file as an input.

Done this way, it will run in the waveform-exporter container. The snakemake workflow is called by cron.

The data should be converted to parquet files (see the code that already converts the main data to parquet), including metadata footer. Probably want a new top-level sub-directory rather than putting this into pseudonymised. Would have to add this directory to the allow-list that is enforced in exporter.ftps.do_upload. Then need another rule to upload the daily dump via FTPS. Unsure if it should be split into different files per CSN or one big file.

Am assuming that this data is infrequent enough that one row per data point is ok. So values column should be scalar. Might need value_string, value_numeric columns etc since parquet is strongly typed.

Definition of Done

  • Query runs autmatically with the correct dependencies
  • Results are uploaded with appropriate de-ID

Dependencies

#9

Comments

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions