The Code Quality transforms captures code specific metrics of input data.
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE setup.
- Shivdeep Singh (Shivdeep.Singh@ibm.com)
- Yang Zhao (yangzhao@ibm.com)
This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:
- line specific metrics include mean & max line length
- character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
- identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
- tags the samples as autogenerated if the sample contains keywords like
auto-generated,autogeneratedorautomatically generated - programming language specific identification, where:
- if the input sample is
pythonprogramming language and sample has no reference to constructs like def, class, it is highlighted ashas_no_keywords
- if the input sample is
This module adds the following fields into the output file:
- line_mean
- line_max
- total_num_lines
- avg_longest_lines
- alphanum_frac
- char_token_ratio
- autogenerated
- config_or_test
- has_no_keywords
- has_few_assignments
- is_xml
- is_html
- top_word_percent
- top_two_words_percent
- alpha_percent
- max_encoded_data_length
- encoded_data_percent
It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.
The set of dictionary keys holding CodeQualityTransform configuration are as follows:
| Key name | Default | Description |
|---|---|---|
| contents_column_name | contents | Name of the column that holds the code/document content |
| language_column_name | language | Name of the column that holds programming language information |
| tokenizer | codeparrot/codeparrot | HuggingFace tokenizer to use |
| hf_token | env-based | HuggingFace auth token (optional if public) |
python -m dpk_code_quality.runtime \\
--cq_contents_column_name contents \\
--cq_language_column_name language \\
--cq_tokenizer codeparrot/codeparrot \\
--cq_hf_token <your_hf_token> \\
--data_local_config "{'input_folder': 'test-data/input', 'output_folder': 'output'}"Following the testing strategy of data-processing-lib
Currently, we have: