-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
The pure SQL query will need to be run somewhere and the results will need to be processed into parquet files.
Pseudonymisation should be a simple matter of applying the lookup in the hashes summary file, although you can just query the hasher again. We will have to double check that no other pseudonymisation is required.
Suggested implementation
I suggest that the queries are run inside python using psycopg2, as that's what we're already using for the Emap correlation query. It should be controlled by a Snakefile rule so that dependencies are managed correctly (eg. the job doesn't start before the previous one in the chain has finished). It should use the daily hashes.json file as an input.
Done this way, it will run in the waveform-exporter container. The snakemake workflow is called by cron.
The data should be converted to parquet files (see the code that already converts the main data to parquet), including metadata footer. Probably want a new top-level sub-directory rather than putting this into pseudonymised. Would have to add this directory to the allow-list that is enforced in exporter.ftps.do_upload. Then need another rule to upload the daily dump via FTPS. Unsure if it should be split into different files per CSN or one big file.
Am assuming that this data is infrequent enough that one row per data point is ok. So values column should be scalar. Might need value_string, value_numeric columns etc since parquet is strongly typed.
Definition of Done
- Query runs autmatically with the correct dependencies
- Results are uploaded with appropriate de-ID