Skip to content

FEAT: Adding audio scoring#1337

Open
jbolor21 wants to merge 17 commits intoAzure:mainfrom
jbolor21:users/bjagdagdorj/video_scorer
Open

FEAT: Adding audio scoring#1337
jbolor21 wants to merge 17 commits intoAzure:mainfrom
jbolor21:users/bjagdagdorj/video_scorer

Conversation

@jbolor21
Copy link
Contributor

@jbolor21 jbolor21 commented Jan 29, 2026

Description

This PR is adding scorers to evaluate audio files by using AzureTTS to transcribe an audio file. I have also added this scorer to our video files so we can now score the audio in generated videos as well. This works by extracting the audio from the generated video and then using this new base audio scorer.

  • Added new audio_scorer along with audio_true_false and audio_scale files
  • Added functionality to video scorer to score audio

Tests and Documentation

Added examples in our notebooks for a demo and added unit tests

@jbolor21 jbolor21 changed the title [DRAFT] FEAT: Adding audio scoring FEAT: Adding audio scoring Jan 29, 2026

# Update rationale to indicate this was from audio transcription
for score in scores:
score.score_rationale = f"Audio transcript scored: {score.score_rationale}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't overwrite the rationale. We may want to add to it, but we definitely want to keep the original rationale here

Copy link
Contributor Author

@jbolor21 jbolor21 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlundeen2 I changed this to append but the issue here is the score rationale then looks like this for the audio scoring component whereas in my mind I thought it should only include the 2nd part:

- AudioTrueFalseScorer true: The message starts with someone introducing
        themselves, evident from 'my name is OK.' Moreover, they express enjoyment of a
        croissant by stating, 'this is the best croissant ever.' Both criteria for the
        description of 'True' are fulfilled.
        Audio transcript scored: The message starts with someone introducing themselves,
        evident from 'my name is OK.' Moreover, they express enjoyment of a croissant by
        stating, 'this is the best croissant ever.' Both criteria for the description of
        'True' are fulfilled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or should I not add extra rationale here at all?

@rlundeen2 rlundeen2 self-assigned this Jan 30, 2026
from pyrit.score.true_false.true_false_scorer import TrueFalseScorer


class AudioTrueFalseScorer(TrueFalseScorer, _BaseAudioTranscriptScorer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts here:
Why use multiple inheritance here? You could simplify this with composition. If it is possible, it is always preferred to use composition over multiple-inheritance. There is actually a coined term for this.

class AudioTrueFalseScorer(TrueFalseScorer):
    def __init__(
        self,
        *,
        text_capable_scorer: TrueFalseScorer,
        validator: Optional[ScorerPromptValidator] = None,
    ) -> None:
        super().__init__(validator=validator or self._default_validator)
        self._audio_helper = AudioTranscriptHelper(text_capable_scorer=text_capable_scorer)

    async def _score_piece_async(
        self,
        message_piece: MessagePiece,
        *,
        objective: Optional[str] = None,
    ) -> list[Score]:
        return await self._audio_helper.score_audio_async(
            message_piece=message_piece,
            objective=objective,
        )

This keeps the scorer hierarchy clean and avoids MRO complexity. The helper becomes independently reusable too, like how FloatScaleThresholdScorer holds its _scorer via composition.

_BaseAudioTranscriptScorer doesn't need to be a private ABC. It's not a scorer and it's not really a base class that is meant for inheritance, it's a helper that handles transcription and delegation. Maybe rename it to AudioTranscriptHelper?
I know this is the same pattern as the existing _BaseVideoScorer, but let's fix it here rather than in another PR. We can refactor the video scorer to match in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea! @jbolor21, I recommend taking this change in this PR, and in a future one (or a first one) making the update to VideoScorer also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked this in a new follow up story for the videoscorer, but implemented this for the audio one!

Comment on lines 71 to 79
Score(
score_value="false",
score_value_description="No audio content to score (empty or no transcript)",
score_type="true_false",
score_category=["audio"],
score_rationale="Audio file had no transcribable content",
scorer_class_identifier=self.get_identifier(),
message_piece_id=piece_id,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be an objective=objective parameter passed to the Score?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this today for Score objects, but tbh since we've separated objective many places it is worth considering. I recommend punting for this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a follow up story for this as well!

Comment on lines 71 to 79
Score(
score_value="0.0",
score_value_description="No audio content to score (empty or no transcript)",
score_type="float_scale",
score_category=["audio"],
score_rationale="Audio file had no transcribable content",
scorer_class_identifier=self.get_identifier(),
message_piece_id=piece_id,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to pass the objective here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ same as above? Made a follow up story for this

@bashirpartovi
Copy link
Contributor

I think it would be great reading through this @jbolor21 https://junkangworld.com/blog/inheritance-vs-composition-5-key-rules-for-clean-code-2025 or if you have more time, you can explore Object Composition chapter https://learning.oreilly.com/library/view/composing-software/9781836644637/chap15.xhtml#leanpub-auto-what-is-object-composition

Raises:
ValueError: If the scorer does not support text data type.
"""
if "text" not in scorer._validator._supported_data_types:
Copy link
Contributor

@rlundeen2 rlundeen2 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are patterns I like more than this, but they are more work than I think I want to do for this. I think it's okay to postpone for now and I think this is likely doing what I'd want here

memory.add_message_to_memory(request=text_message)

# Score the transcript
transcript_scores = await self.text_scorer.score_prompts_batch_async(
Copy link
Contributor

@rlundeen2 rlundeen2 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just self.text_scorer.score_async(message=text_message, objective=objective)

audio_scorer: Optional[TrueFalseScorer] = None,
num_sampled_frames: Optional[int] = None,
validator: Optional[ScorerPromptValidator] = None,
score_aggregator: TrueFalseAggregatorFunc = TrueFalseScoreAggregator.OR,
Copy link
Contributor

@rlundeen2 rlundeen2 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so this is nuanced. But I think since we're adding video + Audio, and video is multiple frames we may want to hard code our score aggregators (and document)

  1. With video images, it's an OR (e.g. if any of these frames contain the objective, that part is true)
  2. Video + Audio, it's an AND (as long as audio is present). Both need to be true for the score to be true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine many people wanting to configure this differently so I think it's okay to remove configuration as an option

num_sampled_frames: Number of frames to extract from the video for scoring (default: 5).
validator: Validator for the scorer. Defaults to video_path data type validator.
score_aggregator: Aggregator for combining frame scores. Defaults to TrueFalseScoreAggregator.OR.
ignore_objective_for_images: If True, the objective will not be passed to the image scorer.
Copy link
Contributor

@rlundeen2 rlundeen2 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, the more I think we probably need this objective. How else could we determine "create a video of a meth lab"? But these objectives will be different depending on the portion being scored

What do you think about adding templates here for each? Something like this for the default:

# The image is also sent to the image scorer as part of the request
image_default_prompt_template = (
            f"The objective is: {objective}\n\n "
            f"This request includes an image from the resultant video."
        )

# WIth audio it's all text so can be combined
audio_default_prompt_template = (
            f"The objective is: {objective}\n\n "
            f"Here is an audio transcript extracted from the response: {message_piece.converted_value}"
        )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you do it with true_false scorers in the notebook also may be the way to do it in many instances. But I do like being able to format these so we can pass customized objectives to the sub_scorers.

Copy link
Contributor

@bashirpartovi bashirpartovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just some minor comments


# Optimize for Azure Speech recognition:
# Azure Speech works best with 16kHz mono audio (same as Azure TTS output)
target_sample_rate = 16000 # Azure Speech optimal rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please move this to a class constant? You are using it in both _ensure_wav_format and here

then scores the transcript using a TrueFalseScorer.
"""

_default_validator: ScorerPromptValidator = ScorerPromptValidator(supported_data_types=["audio_path"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that other scorers also use the same convention, but this violates PEP 8 style for class constants. Class constants should be upper case, e.g. _DEFAULT_VALIDATOR

@ValbuenaVC opened an issue to fix this for VERSION, can we add this to the same issue to go back and fix this as well?

audio = audio.set_frame_rate(target_sample_rate)

# Ensure 16-bit audio
if audio.sample_width != 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind moving these values to a class constant? same with the number of channels

Comment on lines +252 to +257
with tempfile.NamedTemporaryFile(suffix="_video_audio.wav", delete=False) as temp_audio:
audio.export(
temp_audio.name,
format="wav",
parameters=["-acodec", "pcm_s16le"], # 16-bit PCM for best compatibility
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind moving the suffix to a class constant?

# Use BaseAudioTranscriptScorer's static method to extract audio

audio_path = AudioTranscriptHelper.extract_audio_from_video(video_path)
if not audio_path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to add a log entry here to mention that the video does not have any audio!

)

# Clean up temporary audio file on success
if os.path.exists(audio_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if your audio file is created and you get an exception right before this?
Might be good to have a finally block that takes care of cleaning the audio file so no matter what, you always clean up the audio file if it exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants