Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This annotator applies heuristic rules from C4 https://jmlr.org/papers/volume21/20-074/20-074.pdf It follows the reference implementations from Tensorflow and Datatrove but, instead of filtering out data, it is annotated. Based on these annotations, data can be filtered later.
Apply C4 Quality filters
- Retain only lines that end in a terminal punctuation mark (! . " ?)
- Discard any page with fewer than 5 sentences and only retain lines that contain at least 3 words
- [NOT IMPLEMENTED] Remove any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”
- Remove any line with the word Javascript.
- Remove any page where the phrase “lorem ipsum” appears
- Remove any pages that contain a curly bracket
Additional filters not mentioned on the list from the paper but on the code:
- Remove lines with one word over 1000 chars
- Remove lines with cookies and terms of use keywords
Apply paragraph filtering from C4
- Remove the documents that have too few or too short paragraphs.
Apply badwords filter from C4
- Remove the documents containing more than a specific fraction of bad words
The set of dictionary keys holding AnnotatorC4Transform configuration for values are as follows:
- c4a_contents_column_name - specifies the name of the column holding the document text
- c4a_clean_contents_column_name - specifies the name of the column where the cleaned document is saved in the output table. This column is added to the output tables. The default is
text. - c4a_drop_reason_column_name - specifies the name of the column where the reason to drop a document (an empty string for the documents that are kept) is saved in the output table. This column is added to the output tables. The default is
drop_reason. - c4a_doc_stats_column_name - specifies the name of the column where the document stats are saved in the output table. This column is added to the output tables. The default is
doc_stats. - c4a_tokenizer_language - specifies the language for which a specific punkt tokenizer from nltk will be loaded. Currently, only English (
en) language is supported. - c4a_split_paragraph - if True, split the document text on "\n". Set to False to apply the filters to each sentence instead of to each line. The default is
True. - c4a_remove_citations_cli_param - if True, remove wikipedia style citations from the text. The default is
True. - c4a_filter_no_terminal_punct - if True, remove lines without terminal punctuation marks. The default is
True. - c4a_min_num_sentences - specifies the minimum number of sentences (after line filtering) in a valid document. Set to -1 to disable. The default is
5. - c4a_min_words_per_line - specifies the minimum number of words in a valid line. Set to -1 to disable. The default is
3. - c4a_max_word_length - specifies the maximum length of a valid word. Drop the lines with words longer than this limit. Set to -1 to disable. The default is
1000. - c4a_filter_lorem_ipsum - if True, mark for deletion the documents that contain "lorem ipsum". The default is
True. - c4a_filter_javascript - if True, drop lines mentioning "javascript". The default is
True. - c4a_filter_curly_bracket - if True, drop documents containing '{' or '}'. The default is
True. - c4a_filter_policy - if True, drop lines containing any of the phrases in POLICY_SUBSTRINGS. The default is
True. - c4a_min_paragraphs - specifies the minimum number of valid paragraphs in a valid document. Set to -1 to disable. The default is
3. - c4a_min_paragraph_len - specifies the minimum length of a valid paragraph in a document. Set to -1 to disable. The default is
200. - c4a_paragraph_delimiter - specifies the character used to delimit paragraphs. The default is
\n. - c4a_ldnoobw_url - specifies the URL from which the LDNOOBW list will be retrieved.
- c4a_filter_badwords - if True, mark for deletion documents containing bad words. The default is
False. - c4a_badwords_keep_fraction - specifies the percentage of pages containing bad words that should be kept. The default is
0.0. - c4a_badwords_seed - specifies the seed used for the uniform distribution generator for use with keep_fraction. The default is
None.
Additionally, a set of data access-specific arguments are provided that enable the specification of the location of domain list files, so that these files could be stored in the local file system or in S3 storage, for example. The arguments are as follows (and generally match the TransformLauncher's data access arguments but with the `c4a_' prefix).
- c4a_local_config - specifies the input and outout folders, although these are not used by the transform.
- c4a_s3_config - specifies the input and output paths in s3.
- c4a_s3_credentials - provides credentials to access the s3 storage.
See the Command Line options below for specifics on these.
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the launcher.
options:
-h, --help show this help message and exit
--c4a_contents_column_name C4A_CONTENTS_COLUMN_NAME
Name of the column holding the document text
--c4a_clean_contents_column_name C4A_CLEAN_CONTENTS_COLUMN_NAME
Name of the column where the cleaned document is saved in the output table
--c4a_drop_reason_column_name C4A_DROP_REASON_COLUMN_NAME
Name of the column where the keep document decision (true/false) is saved
--c4a_doc_stats_column_name C4A_DOC_STATS_COLUMN_NAME
Name of the column where the document stats are saved
--c4a_tokenizer_language C4A_TOKENIZER_LANGUAGE
Language for which a specific punkt tokenizer from nltk will be loaded
--c4a_split_paragraph C4A_SPLIT_PARAGRAPH
If True, split on '
' Set to False to apply the filters to each sentence instead of to each line
--c4a_remove_citations C4A_REMOVE_CITATIONS
If True, remove wikipedia style citations from the text
--c4a_filter_no_terminal_punct C4A_FILTER_NO_TERMINAL_PUNCT
If True, remove lines without terminal punctuation marks
--c4a_min_num_sentences C4A_MIN_NUM_SENTENCES
Minimum number of sentences (after line filtering) in a valid document. Set to -1 to disable
--c4a_min_words_per_line C4A_MIN_WORDS_PER_LINE
Minimum number of words in a valid line. Set to -1 to disable
--c4a_max_word_length C4A_MAX_WORD_LENGTH
Maximum length of a word; drop lines with longer words. Set to -1 to disable
--c4a_filter_lorem_ipsum C4A_FILTER_LOREM_IPSUM
If True, mark for deletion the documents that contain 'lorem ipsum'
--c4a_filter_javascript C4A_FILTER_JAVASCRIPT
If True, drop lines mentioning 'javascript'
--c4a_filter_curly_bracket C4A_FILTER_CURLY_BRACKET
If True, drop documents containing '{' or '}'
--c4a_filter_policy C4A_FILTER_POLICY
If True, drop lines containing any of the phrases in POLICY_SUBSTRINGS
--c4a_min_paragraphs C4A_MIN_PARAGRAPHS
Minimum number of valid paragraphs in a valid document. Set to -1 to disable
--c4a_min_paragraph_len C4A_MIN_PARAGRAPH_LEN
Minimum length of a valid paragraph. Set to -1 to disable
--c4a_paragraph_delimiter C4A_PARAGRAPH_DELIMITER
The character used to delimit paragraphs
--c4a_ldnoobw_url C4A_LDNOOBW_URL
The URL from which the LDNOOBW list will be retrieved
--c4a_filter_badwords C4A_FILTER_BADWORDS
If True, mark for deletion documents containing bad words
--c4a_badwords_keep_fraction C4A_BADWORDS_KEEP_FRACTION
Percentage of pages containing bad words that should be kept
--c4a_badwords_seed C4A_BADWORDS_SEED
The seed used for the uniform distribution generator for use with keep_fraction
--c4a_s3_cred C4A_S3_CRED
AST string of options for s3 credentials. Only required for S3 data access.
access_key: access key help text
secret_key: secret key help text
url: optional s3 url
region: optional s3 region
Example: { 'access_key': 'access', 'secret_key': 'secret',
'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
'region': 'us-east-1' }
--data_s3_cred DATA_S3_CRED
AST string of options for s3 credentials. Only required for S3 data access.
access_key: access key help text
secret_key: secret key help text
url: optional s3 url
region: optional s3 region
Example: { 'access_key': 'access', 'secret_key': 'secret',
'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
'region': 'us-east-1' }
--data_s3_config DATA_S3_CONFIG
AST string containing input/output paths.
input_folder: Path to input folder of files to be processed
output_folder: Path to output folder of processed files
Example: { 'input_folder': 's3-path/your-input-bucket',
'output_folder': 's3-path/your-output-bucket' }
--data_local_config DATA_LOCAL_CONFIG
ast string containing input/output folders using local fs.
input_folder: Path to input folder of files to be processed
output_folder: Path to output folder of processed files
Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
--data_max_files DATA_MAX_FILES
Max amount of files to process
--data_checkpointing DATA_CHECKPOINTING
checkpointing flag
--data_files_to_checkpoint DATA_FILES_TO_CHECKPOINT
list of file extensions to choose for checkpointing.
--data_data_sets DATA_DATA_SETS
List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
--data_files_to_use DATA_FILES_TO_USE
list of file extensions to choose for input.
--data_num_samples DATA_NUM_SAMPLES
number of random input files to process
--runtime_pipeline_id RUNTIME_PIPELINE_ID
pipeline id
--runtime_job_id RUNTIME_JOB_ID
job id