Models follow the naming convention MultiSynt/nemotron-cc-{language}-{translation-model}.
| Language | tower9b | tower72b | opus | include | belebele | flores | others |
|---|---|---|---|---|---|---|---|
| Basque | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | |
| Catalan | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | copa_ca |
| Danish | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | |
| Dutch | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | |
| Finnish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| French | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | |
| German | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Icelandic | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | icelandic_winogrande |
| Italian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Norwegian | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | |
| Polish | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | |
| Portuguese | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | |
| Romanian | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | |
| Spanish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Swedish | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ |
Unlike HPLT2 monolingual models, no model has intermediate checkpoints available.
- Full coverage (all 3 models): Finnish, German, Italian, Spanish, Swedish
- Missing tower72b only: Danish, Dutch, French, Icelandic, Polish, Romanian
- Missing tower9b + tower72b: Basque, Catalan, Portuguese
- Missing opus only: Norwegian
- Norwegian is the only language missing opus while having tower9b.
- include: Basque, Dutch, Finnish, French, German, Italian, Polish, Portuguese, Spanish (9/15 languages)
- belebele: Basque, Catalan, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Romanian, Spanish, Swedish (14/15 languages)
- flores: Danish, Dutch, Finnish, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish (11/15 languages — Catalan, Basque, Icelandic, Norwegian pending flores_plus migration, see
flores_migration.md) - others: Catalan (
copa_ca), Icelandic (icelandic_winogrande) - No benchmark coverage: (none)
Steps:
- evaluate multisynt models with tower72b > tower9B > opus on all languages
- evaluate hplt models on all languages
- add missing evaluation noreval etc in oellm-cli
The script generate_eval_list.py produces multisynt_evals.csv, a flat CSV with one row per (model, task) pair, ready to be passed to oellm-cli.
The core of the script is LANGUAGE_TASKS, a dictionary mapping each language to its list of individual evaluation tasks with their n-shot settings and eval suite. The tasks are chosen based on the benchmark coverage table above — for example, languages present in belebele get belebele_{lang}_Latn, languages in flores get the corresponding flores200: tasks, and so on. Languages with no benchmark coverage (Catalan, Icelandic, Norwegian) have an empty task list. For each model, the best available translation model is selected following the priority tower72b > tower9b > opus.
To regenerate the CSV:
uv run generate_eval_list.pyThis writes multisynt_evals.csv with columns: model_path, task_path, n_shot, eval_suite.
Copy the CSV to the cluster (e.g. LUMI):
scp multisynt_evals.csv lumi:Then on the cluster, schedule all evaluations with oellm-cli:
oellm schedule-eval --eval_csv_path multisynt_evals.csv