Skip to content

Commit dc940c4

Browse files
authored
docs: add docs for mmd operator (#173)
1 parent d584b30 commit dc940c4

2 files changed

Lines changed: 278 additions & 0 deletions

File tree

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
title: MMDDatasetEvaluator
3+
createTime: 2025/04/04 19:46
4+
permalink: /en/api/operators/text_sft/eval/mmddatasetevaluator/
5+
---
6+
7+
## 📘 Overview
8+
9+
The `MMDDatasetEvaluator` is an operator that evaluates the distribution discrepancy between two datasets using the Maximum Mean Discrepancy (MMD) method. It embeds text into a high-dimensional space and computes the kernel-based distance to quantify the distribution shift between the evaluation dataset and a reference dataset. A smaller MMD score indicates that the two distributions are closer.
10+
11+
## `__init__`
12+
13+
```python
14+
def __init__(
15+
self,
16+
ref_frame: DataFlowStorage,
17+
*,
18+
ref_max_sample_num: int = 5000,
19+
ref_shuffle_seed: int = 42,
20+
ref_instruction_key: str = "input",
21+
ref_output_key: str = "output",
22+
kernel_type: Literal["RBF"] = "RBF",
23+
bias: bool = True,
24+
rbf_sigma: float = 1.0,
25+
embedding_type: Literal["vllm", "sentence_transformers"] = "sentence_transformers",
26+
embedding_model_name: str | None = None,
27+
st_device: str = "cuda",
28+
st_batch_size: int = 32,
29+
st_normalize_embeddings: bool = True,
30+
vllm_max_num_seqs: int = 128,
31+
vllm_gpu_memory_utilization: float = 0.9,
32+
vllm_tensor_parallel_size: int = 1,
33+
vllm_pipeline_parallel_size: int = 1,
34+
vllm_truncate_max_length: int = 40960,
35+
cache_type: Literal["redis", "none"] = "none",
36+
redis_url: str = "redis://127.0.0.1:6379",
37+
max_concurrent_requests: int = 50,
38+
redis_db: int = 0,
39+
cache_model_id: str | None = None,
40+
)
41+
```
42+
43+
| Parameter | Type | Default | Description |
44+
| :--- | :--- | :--- | :--- |
45+
| **ref_frame** | DataFlowStorage | Required | The reference dataset used as the distribution baseline. |
46+
| **ref_max_sample_num** | int | `5000` | Maximum number of samples to draw from the reference dataset. |
47+
| **ref_shuffle_seed** | int | `42` | Random seed for sampling the reference dataset. |
48+
| **ref_instruction_key** | str | `'input'` | Column name for the instruction field in the reference dataset. |
49+
| **ref_output_key** | str | `'output'` | Column name for the output field in the reference dataset. |
50+
| **kernel_type** | str | `'RBF'` | Kernel function type; currently only `'RBF'` is supported. |
51+
| **bias** | bool | `True` | Whether to use bias in the MMD computation. |
52+
| **rbf_sigma** | float | `1.0` | Bandwidth parameter for the RBF kernel. |
53+
| **embedding_type** | str | `'sentence_transformers'` | Embedding backend to use; either `'sentence_transformers'` or `'vllm'`. **Note:** when using `'vllm'`, you need to install `distflow[vllm]` first. |
54+
| **embedding_model_name** | str | Required | Name of the embedding model (required). |
55+
| **st_device** | str | `'cuda'` | Device for SentenceTransformers (e.g., `'cuda'`, `'cpu'`). |
56+
| **st_batch_size** | int | `32` | Batch size for SentenceTransformers inference. |
57+
| **st_normalize_embeddings** | bool | `True` | Whether to normalize embeddings from SentenceTransformers. |
58+
| **vllm_max_num_seqs** | int | `128` | Maximum number of sequences for vLLM. |
59+
| **vllm_gpu_memory_utilization** | float | `0.9` | GPU memory utilization ratio for vLLM. |
60+
| **vllm_tensor_parallel_size** | int | `1` | Tensor parallel size for vLLM. |
61+
| **vllm_pipeline_parallel_size** | int | `1` | Pipeline parallel size for vLLM. |
62+
| **vllm_truncate_max_length** | int | `40960` | Maximum truncation length for vLLM inputs. |
63+
| **cache_type** | str | `'none'` | Cache type for embeddings; either `'redis'` or `'none'`. |
64+
| **redis_url** | str | `'redis://127.0.0.1:6379'` | Redis connection URL when `cache_type='redis'`. |
65+
| **max_concurrent_requests** | int | `50` | Maximum concurrent requests to Redis. |
66+
| **redis_db** | int | `0` | Redis database index. |
67+
| **cache_model_id** | str | `None` | Model identifier used for the Redis cache key. |
68+
69+
## `run`
70+
71+
```python
72+
def run(
73+
self,
74+
storage: DataFlowStorage,
75+
input_instruction_key: str,
76+
input_output_key: str,
77+
max_sample_num: int | None = None,
78+
shuffle_seed: int | None = None,
79+
) -> tuple[float, dict[str, Any]]
80+
```
81+
82+
| Parameter | Type | Default | Description |
83+
| :--- | :--- | :--- | :--- |
84+
| **storage** | DataFlowStorage | Required | The DataFlowStorage instance containing the evaluation dataset. |
85+
| **input_instruction_key** | str | Required | Column name for the instruction field in the evaluation dataset. |
86+
| **input_output_key** | str | Required | Column name for the output field in the evaluation dataset. |
87+
| **max_sample_num** | int | `None` | Maximum samples from the evaluation dataset; falls back to `ref_max_sample_num` if not set. |
88+
| **shuffle_seed** | int | `None` | Random seed for sampling the evaluation dataset; falls back to `ref_shuffle_seed` if not set. |
89+
90+
## 🧠 Example Usage
91+
92+
```python
93+
from dataflow.operators.text_sft.eval import MMDDatasetEvaluator
94+
from dataflow.utils.storage import FileStorage
95+
96+
# Prepare reference and evaluation storages
97+
ref_storage = FileStorage(first_entry_file_name="reference_data.jsonl")
98+
eval_storage = FileStorage(first_entry_file_name="eval_data.jsonl")
99+
100+
# Initialize the evaluator
101+
evaluator = MMDDatasetEvaluator(
102+
ref_frame=ref_storage.step(),
103+
ref_instruction_key="instruction",
104+
ref_output_key="output",
105+
embedding_type="sentence_transformers",
106+
embedding_model_name="BAAI/bge-large-zh",
107+
st_device="cuda",
108+
st_batch_size=32,
109+
)
110+
111+
# Run evaluation
112+
mmd_score, mmd_meta = evaluator.run(
113+
eval_storage.step(),
114+
input_instruction_key="instruction",
115+
input_output_key="output",
116+
)
117+
print(f"MMD Score: {mmd_score}, Meta: {mmd_meta}")
118+
```
119+
120+
#### 🧾 Default Output Format
121+
122+
| Field | Type | Description |
123+
| :--- | :--- | :--- |
124+
| **MMDScore** | float | The computed MMD distance (smaller is closer). |
125+
| **MMDMeta** | dict | Metadata dictionary containing computation details. |
126+
127+
**Example Output:**
128+
```json
129+
{
130+
"MMDScore": 0.00342,
131+
"MMDMeta": {
132+
"num_src_samples": 5000,
133+
"num_tgt_samples": 5000
134+
}
135+
}
136+
```
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: MMDDatasetEvaluator
3+
createTime: 2025/04/04 19:46
4+
permalink: /zh/api/operators/text_sft/eval/mmddatasetevaluator/
5+
---
6+
7+
## 📘 概述
8+
9+
[MMDDatasetEvaluator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/text_sft/eval/mmd_dataset_evaluator.py) 是一个基于最大均值差异(Maximum Mean Discrepancy, MMD)的数据集评估算子。它通过将文本嵌入到高维空间,并计算核函数距离,来量化评估数据集与参考数据集之间的分布偏移程度。MMD 值越小,表示两个数据集的分布越接近。
10+
11+
## `__init__`
12+
13+
```python
14+
def __init__(
15+
self,
16+
ref_frame: DataFlowStorage,
17+
*,
18+
ref_max_sample_num: int = 5000,
19+
ref_shuffle_seed: int = 42,
20+
ref_instruction_key: str = "input",
21+
ref_output_key: str = "output",
22+
kernel_type: Literal["RBF"] = "RBF",
23+
bias: bool = True,
24+
rbf_sigma: float = 1.0,
25+
embedding_type: Literal["vllm", "sentence_transformers"] = "sentence_transformers",
26+
embedding_model_name: str | None = None,
27+
st_device: str = "cuda",
28+
st_batch_size: int = 32,
29+
st_normalize_embeddings: bool = True,
30+
vllm_max_num_seqs: int = 128,
31+
vllm_gpu_memory_utilization: float = 0.9,
32+
vllm_tensor_parallel_size: int = 1,
33+
vllm_pipeline_parallel_size: int = 1,
34+
vllm_truncate_max_length: int = 40960,
35+
cache_type: Literal["redis", "none"] = "none",
36+
redis_url: str = "redis://127.0.0.1:6379",
37+
max_concurrent_requests: int = 50,
38+
redis_db: int = 0,
39+
cache_model_id: str | None = None,
40+
)
41+
```
42+
43+
### init 参数说明
44+
45+
| 参数名 | 类型 | 默认值 | 说明 |
46+
| :--- | :--- | :--- | :--- |
47+
| **ref_frame** | DataFlowStorage | 必需 | 参考数据集(DataFlowStorage),作为分布比较的基准。 |
48+
| **ref_max_sample_num** | int | `5000` | 从参考数据集中采样的最大样本数。 |
49+
| **ref_shuffle_seed** | int | `42` | 参考数据集采样的随机种子。 |
50+
| **ref_instruction_key** | str | `'input'` | 参考数据集中指令字段的列名。 |
51+
| **ref_output_key** | str | `'output'` | 参考数据集中输出字段的列名。 |
52+
| **kernel_type** | str | `'RBF'` | 核函数类型,当前仅支持 `'RBF'`|
53+
| **bias** | bool | `True` | 是否在 MMD 计算中使用偏置项。 |
54+
| **rbf_sigma** | float | `1.0` | RBF 核函数的带宽参数。 |
55+
| **embedding_type** | str | `'sentence_transformers'` | 嵌入模型后端,可选 `'sentence_transformers'``'vllm'`**注意:** 使用 `'vllm'` 时需先安装 `distflow[vllm]`|
56+
| **embedding_model_name** | str | 必需 | 嵌入模型名称(必填)。 |
57+
| **st_device** | str | `'cuda'` | SentenceTransformers 的运行设备(如 `'cuda'``'cpu'`)。 |
58+
| **st_batch_size** | int | `32` | SentenceTransformers 的推理批次大小。 |
59+
| **st_normalize_embeddings** | bool | `True` | 是否对 SentenceTransformers 生成的嵌入向量进行归一化。 |
60+
| **vllm_max_num_seqs** | int | `128` | vLLM 的最大序列数。 |
61+
| **vllm_gpu_memory_utilization** | float | `0.9` | vLLM 的 GPU 显存利用率。 |
62+
| **vllm_tensor_parallel_size** | int | `1` | vLLM 的张量并行大小。 |
63+
| **vllm_pipeline_parallel_size** | int | `1` | vLLM 的流水线并行大小。 |
64+
| **vllm_truncate_max_length** | int | `40960` | vLLM 输入的最大截断长度。 |
65+
| **cache_type** | str | `'none'` | 嵌入缓存类型,可选 `'redis'``'none'`|
66+
| **redis_url** | str | `'redis://127.0.0.1:6379'` | 使用 Redis 缓存时的连接地址。 |
67+
| **max_concurrent_requests** | int | `50` | 对 Redis 缓存的最大并发请求数。 |
68+
| **redis_db** | int | `0` | Redis 数据库索引。 |
69+
| **cache_model_id** | str | `None` | Redis 缓存键中使用的模型标识符。 |
70+
71+
## `run`
72+
73+
```python
74+
def run(
75+
self,
76+
storage: DataFlowStorage,
77+
input_instruction_key: str,
78+
input_output_key: str,
79+
max_sample_num: int | None = None,
80+
shuffle_seed: int | None = None,
81+
) -> tuple[float, dict[str, Any]]
82+
```
83+
84+
执行算子主逻辑,从评估数据集中读取样本,计算其与参考数据集之间的 MMD 距离,并返回距离值和元数据。
85+
86+
#### 参数
87+
88+
| 名称 | 类型 | 默认值 | 说明 |
89+
| :--- | :--- | :--- | :--- |
90+
| **storage** | DataFlowStorage | 必需 | 包含评估数据集的数据流存储实例。 |
91+
| **input_instruction_key** | str | 必需 | 评估数据集中指令字段的列名。 |
92+
| **input_output_key** | str | 必需 | 评估数据集中输出字段的列名。 |
93+
| **max_sample_num** | int | `None` | 评估数据集的最大采样数;未设置时默认使用 `ref_max_sample_num`|
94+
| **shuffle_seed** | int | `None` | 评估数据集采样的随机种子;未设置时默认使用 `ref_shuffle_seed`|
95+
96+
## 🧠 示例用法
97+
98+
```python
99+
from dataflow.operators.text_sft.eval import MMDDatasetEvaluator
100+
from dataflow.utils.storage import FileStorage
101+
102+
# 准备参考数据集和评估数据集
103+
ref_storage = FileStorage(first_entry_file_name="reference_data.jsonl")
104+
eval_storage = FileStorage(first_entry_file_name="eval_data.jsonl")
105+
106+
# 初始化评估器
107+
evaluator = MMDDatasetEvaluator(
108+
ref_frame=ref_storage.step(),
109+
ref_instruction_key="instruction",
110+
ref_output_key="output",
111+
embedding_type="sentence_transformers",
112+
embedding_model_name="BAAI/bge-large-zh",
113+
st_device="cuda",
114+
st_batch_size=32,
115+
)
116+
117+
# 执行评估
118+
mmd_score, mmd_meta = evaluator.run(
119+
eval_storage.step(),
120+
input_instruction_key="instruction",
121+
input_output_key="output",
122+
)
123+
print(f"MMD Score: {mmd_score}, Meta: {mmd_meta}")
124+
```
125+
126+
#### 🧾 默认输出格式
127+
128+
| 字段 | 类型 | 说明 |
129+
| :--- | :--- | :--- |
130+
| **MMDScore** | float | 计算得到的 MMD 距离值(越小表示分布越接近)。 |
131+
| **MMDMeta** | dict | 包含计算细节信息的元数据字典。 |
132+
133+
**示例输出:**
134+
```json
135+
{
136+
"MMDScore": 0.00342,
137+
"MMDMeta": {
138+
"num_src_samples": 5000,
139+
"num_tgt_samples": 5000
140+
}
141+
}
142+
```

0 commit comments

Comments
 (0)