Code Quality Transform

The Code Quality transforms captures code specific metrics of input data.

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE setup.

Contributors

Shivdeep Singh (Shivdeep.Singh@ibm.com)
Yang Zhao (yangzhao@ibm.com)

Summary

This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:

line specific metrics include mean & max line length
character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
tags the samples as autogenerated if the sample contains keywords like auto-generated, autogenerated or automatically generated
programming language specific identification, where:
- if the input sample is python programming language and sample has no reference to constructs like def, class, it is highlighted as has_no_keywords

This module adds the following fields into the output file:

line_mean
line_max
total_num_lines
avg_longest_lines
alphanum_frac
char_token_ratio
autogenerated
config_or_test
has_no_keywords
has_few_assignments
is_xml
is_html
top_word_percent
top_two_words_percent
alpha_percent
max_encoded_data_length
encoded_data_percent

It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.

Configuration and Command Line Options

The set of dictionary keys holding CodeQualityTransform configuration are as follows:

Key name	Default	Description
contents_column_name	contents	Name of the column that holds the code/document content
language_column_name	language	Name of the column that holds programming language information
tokenizer	codeparrot/codeparrot	HuggingFace tokenizer to use
hf_token	env-based	HuggingFace auth token (optional if public)

Example

python -m dpk_code_quality.runtime \\
  --cq_contents_column_name contents \\
  --cq_language_column_name language \\
  --cq_tokenizer codeparrot/codeparrot \\
  --cq_hf_token <your_hf_token> \\
  --data_local_config "{'input_folder': 'test-data/input', 'output_folder': 'output'}"

Notebook example

notebook

Testing

Following the testing strategy of data-processing-lib

Currently, we have:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Quality Transform

Contributors

Summary

Configuration and Command Line Options

Example

Notebook example

Testing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Code Quality Transform

Contributors

Summary

Configuration and Command Line Options

Example

Notebook example

Testing