Skip to content

Latest commit

 

History

History
80 lines (59 loc) · 3.67 KB

File metadata and controls

80 lines (59 loc) · 3.67 KB

Code Quality Transform

The Code Quality transforms captures code specific metrics of input data.

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE setup.

Contributors

Summary

This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:

  • line specific metrics include mean & max line length
  • character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
  • identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
  • tags the samples as autogenerated if the sample contains keywords like auto-generated, autogenerated or automatically generated
  • programming language specific identification, where:
    • if the input sample is python programming language and sample has no reference to constructs like def, class, it is highlighted as has_no_keywords

This module adds the following fields into the output file:

  • line_mean
  • line_max
  • total_num_lines
  • avg_longest_lines
  • alphanum_frac
  • char_token_ratio
  • autogenerated
  • config_or_test
  • has_no_keywords
  • has_few_assignments
  • is_xml
  • is_html
  • top_word_percent
  • top_two_words_percent
  • alpha_percent
  • max_encoded_data_length
  • encoded_data_percent

It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.

Configuration and Command Line Options

The set of dictionary keys holding CodeQualityTransform configuration are as follows:

Key name Default Description
contents_column_name contents Name of the column that holds the code/document content
language_column_name language Name of the column that holds programming language information
tokenizer codeparrot/codeparrot HuggingFace tokenizer to use
hf_token env-based HuggingFace auth token (optional if public)

Example

python -m dpk_code_quality.runtime \\
  --cq_contents_column_name contents \\
  --cq_language_column_name language \\
  --cq_tokenizer codeparrot/codeparrot \\
  --cq_hf_token <your_hf_token> \\
  --data_local_config "{'input_folder': 'test-data/input', 'output_folder': 'output'}"

Notebook example

notebook

Testing

Following the testing strategy of data-processing-lib

Currently, we have: