Hi organizer,
Thanks for providing the preprocessing code!
I was trying to reproduce the QED processing and splitting. Everything up to the below chunk of code is reproducible:
cd ${PROJECT_DIR}/tmp
gdown 1R2xWtNeVX48RiFA7vErL1pNtws3XEsYP
unzip qed.zip
cd ${PROJECT_DIR}
python preprocess_qed.py tmp/en tmp/qed
But when I tried to do
cat tmp/qed/* >> preprocessed_data/qed.txt
directly, I ran into the error:
bash: /bin/cat: Argument list too long
I tried to use a bunch of methods to concatenate the files together (e.g. write to a final txt file directly in preprocess_qed.py, use different linux commands to avoid the long argument list issues while mimicking the behaviour of cat *). But none of these methods get the same split as the released processed QED dataset, after running
python sample_chunks_and_split.py --input_file preprocessed_data/qed.txt --output_dir babylm_data --n_keep 10000000 --n_keep_dev 1000000 --split_at 5000 --seed 3
with the concatenated file.
I was wondering if there is a specific order to concatenate them, or simply, a way to avoid the error while reproducing the same split as the released data.
Thank you so much!
Best,
Ziling
Hi organizer,
Thanks for providing the preprocessing code!
I was trying to reproduce the QED processing and splitting. Everything up to the below chunk of code is reproducible:
But when I tried to do
directly, I ran into the error:
I tried to use a bunch of methods to concatenate the files together (e.g. write to a final txt file directly in
preprocess_qed.py, use different linux commands to avoid the long argument list issues while mimicking the behaviour ofcat *). But none of these methods get the same split as the released processed QED dataset, after runningwith the concatenated file.
I was wondering if there is a specific order to concatenate them, or simply, a way to avoid the error while reproducing the same split as the released data.
Thank you so much!
Best,
Ziling