Skip to content

QED preprocessing - cannot reproduce the same split #1

@ziling-cheng

Description

@ziling-cheng

Hi organizer,

Thanks for providing the preprocessing code!

I was trying to reproduce the QED processing and splitting. Everything up to the below chunk of code is reproducible:

cd ${PROJECT_DIR}/tmp
gdown 1R2xWtNeVX48RiFA7vErL1pNtws3XEsYP
unzip qed.zip
cd ${PROJECT_DIR}
python preprocess_qed.py tmp/en tmp/qed

But when I tried to do

cat tmp/qed/* >> preprocessed_data/qed.txt

directly, I ran into the error:

bash: /bin/cat: Argument list too long

I tried to use a bunch of methods to concatenate the files together (e.g. write to a final txt file directly in preprocess_qed.py, use different linux commands to avoid the long argument list issues while mimicking the behaviour of cat *). But none of these methods get the same split as the released processed QED dataset, after running

python sample_chunks_and_split.py --input_file preprocessed_data/qed.txt --output_dir babylm_data --n_keep 10000000 --n_keep_dev 1000000 --split_at 5000 --seed 3

with the concatenated file.

I was wondering if there is a specific order to concatenate them, or simply, a way to avoid the error while reproducing the same split as the released data.

Thank you so much!

Best,
Ziling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions