-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
When a file is uploaded to the DSH with the same name as a file already on the "arrivals" drive, it doesn't simply overwrite it, it creates a copy with a number on the end of the file stem.
Eg.
2024-08-25.ec2ab2963229a805d6bd4dc80a935c210aa1dda71e7715e12c01b75d60af5ccb.52912.noCh.mL.parquet
2024-08-25.ec2ab2963229a805d6bd4dc80a935c210aa1dda71e7715e12c01b75d60af5ccb.52912.noCh.mL 2.parquet
(note the " 2" added to the end of the file)
This is kind of annoying because our users will now have two files attempting to represent the same information, but won't know which is the true version.
A file could get uploaded twice if the waveform feed went down for the second half of the day causing the CSV files to be incomplete, then they later got completed after the nightly FTPS upload had already happened.
It could also happen if an FTPS connection is interrupted. This would register as a failure by snakemake which would later retry, but FTP implementations are pretty archaic and I wouldn't be surprised if the partial file remains in the mean time. If we're grouping file uploads (see #67) then it's even more likely.
If we were copying them to the S: drive with a script, it could look out for any file whose stem matches the pattern of a number following a space character (which hopefully would never appear in our units column, but consider changing the name of our files to end in .waveform.parquet so we have a clearer pattern to look out for). Then you'd treat the highest number as the correct version.
We'd hoped that we could transfer data to the S: drive using a manual copy (see #12 ), but if we do that we will have to manually inspect the file listing to see if this has happened and manually correct it.
Definition of Done
- We check for duplicate uploads in some way and have a way of avoiding our users seeing duplicate data, or at least mitigate by warning them about this problem
Comments
Related to #67 in that its implementation could create more duplicate uploads: if a 10-file FTPS connection fails halfway through, snakemake is probably going to be unable to tell that the first 5 files did in fact succeed.
Does @stefpiatek or the DSH team have a neat solution to this?