Skip to content

OpenEuroLLM/multisynt-evaluations

Repository files navigation

MultiSynt Nemotron-CC Models by Language

Models follow the naming convention MultiSynt/nemotron-cc-{language}-{translation-model}.

Language tower9b tower72b opus include belebele flores others
Basque
Catalan copa_ca
Danish
Dutch
Finnish
French
German
Icelandic icelandic_winogrande
Italian
Norwegian
Polish
Portuguese
Romanian
Spanish
Swedish

Analysis

Unlike HPLT2 monolingual models, no model has intermediate checkpoints available.

Model coverage

  • Full coverage (all 3 models): Finnish, German, Italian, Spanish, Swedish
  • Missing tower72b only: Danish, Dutch, French, Icelandic, Polish, Romanian
  • Missing tower9b + tower72b: Basque, Catalan, Portuguese
  • Missing opus only: Norwegian
  • Norwegian is the only language missing opus while having tower9b.

Benchmark coverage

  • include: Basque, Dutch, Finnish, French, German, Italian, Polish, Portuguese, Spanish (9/15 languages)
  • belebele: Basque, Catalan, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Romanian, Spanish, Swedish (14/15 languages)
  • flores: Danish, Dutch, Finnish, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish (11/15 languages — Catalan, Basque, Icelandic, Norwegian pending flores_plus migration, see flores_migration.md)
  • others: Catalan (copa_ca), Icelandic (icelandic_winogrande)
  • No benchmark coverage: (none)

Steps:

  • evaluate multisynt models with tower72b > tower9B > opus on all languages
  • evaluate hplt models on all languages
  • add missing evaluation noreval etc in oellm-cli

Generate Evaluations

1. Generate the evaluation list

The script generate_eval_list.py produces multisynt_evals.csv, a flat CSV with one row per (model, task) pair, ready to be passed to oellm-cli.

The core of the script is LANGUAGE_TASKS, a dictionary mapping each language to its list of individual evaluation tasks with their n-shot settings and eval suite. The tasks are chosen based on the benchmark coverage table above — for example, languages present in belebele get belebele_{lang}_Latn, languages in flores get the corresponding flores200: tasks, and so on. Languages with no benchmark coverage (Catalan, Icelandic, Norwegian) have an empty task list. For each model, the best available translation model is selected following the priority tower72b > tower9b > opus.

To regenerate the CSV:

uv run generate_eval_list.py

This writes multisynt_evals.csv with columns: model_path, task_path, n_shot, eval_suite.

2. Schedule evaluations on a cluster

Copy the CSV to the cluster (e.g. LUMI):

scp multisynt_evals.csv lumi:

Then on the cluster, schedule all evaluations with oellm-cli:

oellm schedule-eval --eval_csv_path multisynt_evals.csv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages