A large-scale 7B pretraining language model developed by BaiChuan-Inc.
-
Updated
Jul 18, 2024 - Python
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
A series of large language models developed by Baichuan Intelligent Technology
A 13B large language model developed by Baichuan Intelligent Technology
A Contamination-free Multi-task Language Understanding Benchmark [Official, ACL 2025]
[NeurIPS 2023 Spotlight] In-Context Impersonation Reveals Large Language Models' Strengths and Biases
[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?
Benchmark suite for open-source language models on the edge. Evaluates inference efficiency, MMLU accuracy, and LLM-rated teaching effectiveness.
CLI tool to evaluate LLM factuality on MMLU benchmark.
Code and data accompanying the article "The impact of quantising a small open source LLM". This repository explores how quantisation affects performance, VRAM usage, and inference speed in Qwen3 1.7B.
Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90
An easy-to-use and standardised framework for evaluating Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) dataset. Currently supported: Hugging Face transformer models and Bedrock models.
Framework modular em Python para benchmarking e análise reprodutível de LLMs, com execução via APIs, coleta estruturada de respostas, métricas automáticas (BLEU, ROUGE, BERTScore, MMLU, HellaSwag), rankings e relatórios consolidados.
LLMs' performance analysis on CPU, GPU, Execution Time and Energy Usage
A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).
Dataset management library for ML experiments—loaders for SciFact, FEVER, GSM8K, HumanEval, MMLU, TruthfulQA, HellaSwag; git-like versioning with lineage tracking; transformation pipelines; quality validation with schema checks and duplicate detection; GenStage streaming for large datasets. Built for reproducible AI research.
Open Source ML Model Comparison -- compare 30+ models side-by-side with real MMLU benchmark data. Filter by parameters, license, and use case. Free browser-based machine learning model database, no sign-up.
Dataset management and caching for AI research benchmarks
Add a description, image, and links to the mmlu topic page so that developers can more easily learn about it.
To associate your repository with the mmlu topic, visit your repo's landing page and select "manage topics."