Conversation
- fix dependencies - add structured output to prevent judge from not respecting the prompt
pyproject.toml
Outdated
| [project.optional-dependencies] | ||
| vllm = ["vllm==0.10.2", "transformers>=4.55.2,<5.0.0"] | ||
| # vLLM on PyPI pins transformers<5; optional extra matches that so `uv lock` can resolve. | ||
| vllm = ["vllm>=0.17.0,<1.0.0", "transformers>=4.56.0,<5.0.0"] |
There was a problem hiding this comment.
vllm>=0.17.0,<1.0.0 is a very wide range. A few concerns:
- Was this tested with a prebuilt wheel or built from source? Building vLLM from source on cluster nodes often fails due to CUDA kernel compilation issues.
- Is the
StructuredOutputsParamsimport path (vllm.sampling_params) stable across this entire range? It may have been introduced in 0.17 and could move. For exampleStructuredOutputParamswas a bit different whenvllm==0.11.0. Thus I think it makes more sense to create more stable versioning
There was a problem hiding this comment.
Good point. I tightened the range. 0.18.1 was working. I think the StructuredOutputParams is stable accross the new range.
judgearena/evaluate.py
Outdated
| _PAIR_SCORE_MAX = 10 | ||
|
|
||
|
|
||
| def build_pair_score_output_choices() -> list[str]: |
There was a problem hiding this comment.
The cartesian product approach works for a single A-vs-B pair (11×11 = 121 choices), but won't scale to multi-criteria evaluation — with N dimensions it becomes 11^(2N) choices, which is unusable.
May be we can consider switching to a JSON schema constraint instead of choice, e.g. {"score_A": int, "score_B": int} per criterion. VLLM's StructuredOutputsParams already supports json_schema alongside choice, so this would be a drop-in change.
There was a problem hiding this comment.
Agreed, updated
| ) | ||
| ) | ||
| if truncated_completion_count: | ||
| print( |
There was a problem hiding this comment.
Flagging for a follow-up PR: the codebase mixes print() for warnings, progress, and debug info, making it hard to filter by severity or redirect output. We should migrate to Python's logging module (or at minimum a thin wrapper like logger = logging.getLogger(__name__)). What do you think @geoalgo
ab3db1b to
ef1c92c
Compare
- Switch from choice-based structured outputs to JSON schema constraint - Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0
Uh oh!
There was an error while loading. Please reload this page.