QED preprocessing - cannot reproduce the same split

Hi organizer,

Thanks for providing the preprocessing code!

I was trying to reproduce the QED processing and splitting. Everything up to the below chunk of code is reproducible:
```
cd ${PROJECT_DIR}/tmp
gdown 1R2xWtNeVX48RiFA7vErL1pNtws3XEsYP
unzip qed.zip
cd ${PROJECT_DIR}
python preprocess_qed.py tmp/en tmp/qed
```
But when I tried to do 
```
cat tmp/qed/* >> preprocessed_data/qed.txt
```
directly, I ran into the error:
```
bash: /bin/cat: Argument list too long
```
I tried to use a bunch of methods to concatenate the files together (e.g. write to a final txt file directly in `preprocess_qed.py`, use different linux commands to avoid the long argument list issues while mimicking the behaviour of `cat *`). But none of these methods get the same split as the released processed QED dataset, after running 
``` 
python sample_chunks_and_split.py --input_file preprocessed_data/qed.txt --output_dir babylm_data --n_keep 10000000 --n_keep_dev 1000000 --split_at 5000 --seed 3
```
with the concatenated file.

I was wondering if there is a specific order to concatenate them, or simply, a way to avoid the error while reproducing the same split as the released data.

Thank you so much!

Best,
Ziling




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QED preprocessing - cannot reproduce the same split #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

QED preprocessing - cannot reproduce the same split #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions