Skip to content

Feature/evaluate corpus#54

Merged
tanhaow merged 11 commits intodevelopfrom
feature/evaluate-corpus
Mar 19, 2026
Merged

Feature/evaluate corpus#54
tanhaow merged 11 commits intodevelopfrom
feature/evaluate-corpus

Conversation

@tanhaow
Copy link
Copy Markdown
Contributor

@tanhaow tanhaow commented Mar 17, 2026

Associated Issue(s): resolves #16

Changes in this PR

Include all key changes in this pull request

  • Added evaluate_corpus.py script that generates CSV files containing MT evaluation metrics (ChrF and COMET scores) for machine translation corpora
  • Implemented caching in metrics.py to avoid reloading models for each translation, reducing evaluation time from ~9 seconds per translation to ~0.35 seconds per translation
  • Suppressed verbose PyTorch Lightning logging messages to provide cleaner output during evaluation

Notes

  • Added REQUIRED_FIELDS constant documenting expected input fields in MT corpus records

Reviewer Checklist

  • Run the script on a sample MT corpus to verify CSV output format is correct
  • Verify that ChrF and COMET scores are computed correctly and fall within expected ranges
  • Verify that progress bar and logging output are clean and informative

tanhaow added 5 commits March 11, 2026 08:56
Resolved conflicts by accepting develop's changes:
- Added environment variables to suppress HuggingFace logging
- Wrapped COMET computation in context managers to suppress output
@tanhaow tanhaow requested a review from laurejt March 17, 2026 14:45
@tanhaow tanhaow self-assigned this Mar 17, 2026
Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good and only needs a few small changes. Requested changes:

  • Document why pytorch and huggingface-related enivornment variables are set with metrics.py. Not all of these are related to logging.
  • As a small add-on (since you're already updating metrics.py), please normalize the chrF score so its range is 0-1.

Comment thread src/muse/evaluation/evaluate_corpus.py Outdated
parsed = args.parse_args()

# Setup logging
log_level = logging.DEBUG if parsed.verbose else logging.INFO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, default logging should be set to warning not info, especially when there is a verbose mode.

Comment thread src/muse/evaluation/metrics.py Outdated
# Suppress verbose HuggingFace logging
# Suppress verbose HuggingFace and PyTorch Lightning logging
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to explicitly set this? Were you encountering a warning message?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a lot of INFO messages suggesting LitLogger, which made the output long and messy:

INFO: 💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this pointing to the bigger issue that the code assumes that INFO is the default non-verbose level, which is not standard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I have set the default logging level to WARNING. Let me also remove some of these logging related code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the thing is, the PyTorch Lightning messages are still showing after setting the default logging to WARN. I did some investigation and found it's because PyTorch Lightning uses its own internal logging system that bypasses Python's logging module... So we actually need to keep the stderr contextlib redirects. But I will simplify it into a minimal version

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

Comment thread src/muse/evaluation/metrics.py Outdated
# Suppress verbose HuggingFace and PyTorch Lightning logging
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to set this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for enabling fallback for unsupported MPS operations. But I can remove this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to remove this, just document it. It's worth documenting that this library supports MPS / Apple Silicon whereas we have not done that with other parts of the pipeline / package.

Comment thread src/muse/evaluation/metrics.py Outdated
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["PYTHONWARNINGS"] = "ignore"
os.environ["PL_DISABLE_FORK"] = "1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to set this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended by HuggingFace to prevent tokenizer deadlocks when used with PyTorch's parallel processing during COMET evaluation. I will add a note in the code

Comment thread src/muse/evaluation/metrics.py Outdated
import torch

# Suppress verbose HuggingFace logging
# Suppress verbose HuggingFace and PyTorch Lightning logging
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few of these environment variables are logging related. All others must be documented since they impact the execution of pytorch & huggingface

Comment thread src/muse/evaluation/evaluate_corpus.py Outdated
Comment on lines +23 to +24
# Required fields in input machine translation corpus records
REQUIRED_FIELDS = ["tr_id", "src_text", "ref_text", "tr_text"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but unnecessary. There's no need to validate input we control for experimental code.

Comment on lines +37 to +41
# Cache for loaded metrics to avoid reloading models
LOADED_METRICS = {
"chrf": None,
"comet": None,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much RAM does this require? This will be important especially worth noting when CometKiwi is added.

@@ -28,7 +52,11 @@ def compute_chrf(
Returns a float in the range [0, 100], where 0 indicates no match and 100
indicates a perfect match.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect, the range of ChrF is odd since it typically ranges between 0-1 since it's an f-score. This must have something to do with the implementation that huggingface-evaluate uses.

Let's correct this by normalizing the scores to 0-1 (i.e., divide by 100) and update the relevant documentation.

LOADED_METRICS["comet"] = evaluate.load("comet")

comet_metric = LOADED_METRICS["comet"]
gpus = 1 if (torch.cuda.is_available() or torch.backends.mps.is_available()) else 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, we may want to update this logic if we plan on running this on a cluster (i.e., make it possible to specify the use of gpu or cpu-only regardless of what the device has available).

@tanhaow
Copy link
Copy Markdown
Contributor Author

tanhaow commented Mar 18, 2026

Thanks for your review, @laurejt ! I've made the following changes:

  • Change default logging level to WARNING
  • Simplify and document environment variable configuration
  • Normalize ChrF scores to 0-1 range (-- good catch!)
  • Document RAM requirements for COMET caching
  • Remove REQUIRED_FIELDS constant

@tanhaow tanhaow requested a review from laurejt March 18, 2026 14:15
Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run evaluate_corpus.py none of the pytorch lighnining INFO logging messages are suppressed.

@tanhaow tanhaow requested a review from laurejt March 19, 2026 14:38
@tanhaow
Copy link
Copy Markdown
Contributor Author

tanhaow commented Mar 19, 2026

PyTorch Lightning logging is now suppressed. Now only 1 INFO message appears on first evaluation (checkpoint upgrade notice) instead of 6 messages per translation.

Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch lightning does use Python logging, but I think this is a case of not being specific enough with the logger name. I was able to suppress the logging messages with the following two lines:

logging.getLogger("pytorch_lightning.utilities.rank_zero").setLevel(logging.WARNING)
logging.getLogger("pytorch_lightning.utilities.migration").setLevel(logging.WARNING)

Try simplifying the code to this and verify that this works for your local dev environment as well.

In case you're curious, I figured out the particular module names by searching for the portions of the INFO logs within the pytorch-lightning repo.

Comment thread src/muse/evaluation/metrics.py Outdated
Comment on lines +95 to +102
# Suppress PyTorch Lightning INFO messages by redirecting stderr at file descriptor level
# This is necessary because PyTorch Lightning bypasses Python's logging system
_stderr_fd = sys.stderr.fileno()
_original_stderr_fd = os.dup(_stderr_fd)
_devnull_fd = os.open(os.devnull, os.O_WRONLY)
os.dup2(_devnull_fd, _stderr_fd)

try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of complexity for what we're trying to do, see if my suggestion works instead.

Comment thread src/muse/evaluation/metrics.py Outdated
Comment on lines +109 to +113
finally:
# Restore original stderr
os.dup2(_original_stderr_fd, _stderr_fd)
os.close(_original_stderr_fd)
os.close(_devnull_fd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of complexity for what we're trying to do, see if my suggestion works instead.

@tanhaow
Copy link
Copy Markdown
Contributor Author

tanhaow commented Mar 19, 2026

Pytorch lightning does use Python logging, but I think this is a case of not being specific enough with the logger name. I was able to suppress the logging messages with the following two lines:

logging.getLogger("pytorch_lightning.utilities.rank_zero").setLevel(logging.WARNING)
logging.getLogger("pytorch_lightning.utilities.migration").setLevel(logging.WARNING)

Try simplifying the code to this and verify that this works for your local dev environment as well.

In case you're curious, I figured out the particular module names by searching for the portions of the INFO logs within the pytorch-lightning repo.

This does look like a much more elegant solution!! Ugh you're so smart

@tanhaow tanhaow requested a review from laurejt March 19, 2026 14:56
@tanhaow
Copy link
Copy Markdown
Contributor Author

tanhaow commented Mar 19, 2026

@laurejt It works perfectly. Thank you for digging out this solution!!

Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@tanhaow tanhaow merged commit 80c74b7 into develop Mar 19, 2026
1 check passed
@tanhaow tanhaow deleted the feature/evaluate-corpus branch March 19, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants