Skip to content

Add file_extension, enable_legacy_filename fields to BlobType#3

Merged
ddl-rliu merged 1 commit into1.16.4-dominofrom
rliu.file-ext
Mar 26, 2026
Merged

Add file_extension, enable_legacy_filename fields to BlobType#3
ddl-rliu merged 1 commit into1.16.4-dominofrom
rliu.file-ext

Conversation

@ddl-rliu
Copy link
Copy Markdown

Upstream flyteorg#7009

Tracking issue

Closes flyteorg#7024 [BUG] [copilot] File extensions are missing when copilot downloads Blob/FlyteFile inputs

Why are the changes needed?

We'll assume the above behavior is not a bug (it is long-standing, and a "bugfix" will likely break existing workflows). Instead, this PR proposes enhancements to FlyteFile/Blob to support writing workflow inputs with the file extension. This PR includes several changes across flyteidl and flytecopilot.

The enhancements allow workflows to be flexible when writing blobs. Specifically, the existing behavior where flytecopilot writes blobs during the download phase without the file extension (e.g. "inputs/data") can now be enhanced so that the file extension is included (e.g. "inputs/data.csv").

What changes were proposed in this pull request?

[flyteidl]

Add a new file_extension string field to the BlobType protobuf message, allowing FlyteFile to optionally specify a file extension (e.g. "csv") that flytecopilot appends when writing blobs during the download phase (e.g. "data.csv"). When empty (the default), behavior is unchanged.

Add a new enable_legacy_filename bool field to the BlobType protobuf message, allowing FlyteFile to optionally specify whether to preserve backward compatibility for tasks that read from the extensionless path.

Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger).

[flytecopilot]

The copilot download phase infers the desired download behavior(s) from the input interface.

[flytekit] PR: flyteorg/flytekit#3406

Alternatives considered

The PR's approach configures the file download behaviors at the BlobType level (e.g. per FlyteFile). This has several pros (granularity, explicitness).

But, one con is that unlike BlobType.format (which can be inferred from the output filename e.g. "data.csv" -> format: "csv"), the new fields BlobType.file_extension, BlobType.enable_legacy_filename could not be inferred from the output filename (does "data.csv" match to file_extension: "csv"?). Ultimately, this seems like it is introducing a minor inconsistency, but acceptable given the benefits. (It also seems like this is all hypothetical - copilot upload does not actually infer the format from the filename, so this concern may be somewhat moot?)

Here are the other approaches I considered:

1. New flyte-copilot CLI flags --file_extension_config ENUM=(disabled, enabled, legacy)

file_extension_config = disabled (by default) - Same behavior as today (data: FlyteFile[csv] is written to outputs/data)
file_extension_config = enabled - New behavior as today (data: FlyteFile[csv] is written to outputs/data.csv)
file_extension_config = legacy - New behavior but backwards compatible, (data: FlyteFile[csv] is written to outputs/data and outputs/data.csv)

Cons:

  • The download behavior is granular at the file level, but the CLI flag is granular at the task level (more or less). This makes the configuration less flexible, and begins to somewhat overload the configurability of copilot at the CLI level.

2. No changes, user code should be modified to read from the existing path e.g. data and add the extension itself

Cons:

  • Migrating large projects to Flyte becomes burdensome, when needing to add another extra step like this here - user code needs to be modified to handle adjusting the file extensions on inputs.

How was this patch tested?

# Container task python code
def t1(datasetA: FlyteFile[TypeVar('csv')], datasetB=Annotated[FlyteFile[TypeVar('csv')], FileDownloadConfig(file_extension="csv", enable_legacy_filename=True)]): …

# flyte-copilot logs
{"json":{},"level":"debug","msg":"inputInterface: [variables:{key:\"datasetA\"  value:{type:{blob:{format:\"csv\"}}  description:\"datasetA\"}}  variables:{key:\"datasetB\"  value:{type:{blob:{format:\"csv\"  file_extension:\"csv\"  enable_legacy_filename:true}}  description:\"datasetB\"}}]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"debug","msg":"varFilenames: map[datasetA:[datasetA[] datasetB:[datasetB.csv datasetB]]","ts":"2026-03-10T23:20:55Z"}
...
{"json":{},"level":"info","msg":"Successfully copied [4119] bytes remote data from [s3://rliucp108882-flyte-data//a0c46ac7-eea3-427d-91c1-a3c6b4a27e83/datasetB] to local [/execution-vol/flows/workflow/inputs/datasetB]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"info","msg":"Successfully copied [4119] bytes remote data from [s3://rliucp108882-flyte-data//a0c46ac7-eea3-427d-91c1-a3c6b4a27e83/datasetB] to local [/execution-vol/flows/workflow/inputs/datasetB.csv]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"info","msg":"Successfully copied [3904] bytes remote data from [s3://rliucp108882-flyte-data//af859c71-811d-4c26-9845-b3fd8830ab6b/datasetA] to local [/execution-vol/flows/workflow/inputs/datasetA]","ts":"2026-03-10T23:20:55Z"}

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Add a new `file_extension` string field to the BlobType protobuf message,
allowing FlyteFile to optionally specify a file extension (e.g. "csv") that
flytecopilot appends when writing blobs to local disk during the download
phase (e.g. "data.csv"). When empty (the default), behavior is unchanged.

Add a new `enable_legacy_filename` bool field to the BlobType protobuf message,
allowing FlyteFile to optionally specify whether to preserve backward
compatibility for tasks that read from the extensionless path.

Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger).

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant