Core: Simplify RESTTableScan by removing catalog internals by singhpk234 · Pull Request #15595 · apache/iceberg

singhpk234 · 2026-03-12T07:28:53Z

Remove redundant fields (table, operations, tableIO) from RESTTableScan
Replace catalogProperties + hadoopConf with a FileIO factory function
Pre-compute resource paths instead of passing ResourcePaths + TableIdentifier
Replace Set with supportsAsync/supportsCancel/supportsFetchTasks booleans
Change headers from Map to Supplier for consistency with RESTTable
Fix newRefinedScan to not propagate plan-scoped FileIO
Fix cleanupPlanResources to always invalidate tracker

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

rdblue · 2026-03-12T23:22:05Z

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

+  private final String fetchScanTasksPath;
+  private final boolean supportsAsync;
+  private final boolean supportsCancel;
+  private final boolean supportsFetchTasks;


Why is this a boolean? I think it should be required and planning should fail if task fetching is not supported. @nastra, what do you think? Is it worth supporting cases where plan is implemented but not fetch? That would mean that planning has to be all-at-once, which is a bad design!

Is it worth supporting cases where plan is implemented but not fetch

It could be possible i planned once already and the plan is cached and i can send all-at-once it could be a use case, and i don;t need to do support fetchTask if my planning is fast or its only supported for small tables wdyt ?

@rdblue I think it is reasonable to only support the plan endpoint without having immediate support for the tasks endpoint (independently from whether such a design results in good perf). It is likely that servers implementing remote scan planning will start with the easy endpoint first (plan) and later add the async planning part and I think we should at least not mandate that all endpoints must be supported immediately

@singhpk234, wouldn't you cache the results of each request? The motivation for breaking scan planning into multiple requests isn't just the cost of reading manifests. It is also the cost of sending results (you could have thousands of matching files per plan task) and caching them (if you have thousands per request). I doubt that anyone would want to do the work of aggregating plan task results in a cache only to have a more coarse cache that invalidates an entire planning result rather that just parts of it.

My concern here is also motivated by being able to keep things simple. We don't want new REST catalog features that require gracefully falling back for every single endpoint. That's needless complexity. In this case, as we add planning we need to document which parts are required and which ones are optional. I think that supporting both plan and fetch is a minimum for getting this working and we can simplify by asserting that.

@nastra, this doesn't mean the fetch endpoint actually has to do anything. You could always return all results from plan and avoid hitting it. But we don't need to increase the complexity of all clients.

One can prewarm the caches is one thing i was thinking for small tables ? orthogonally there may be cases where some one is producing a governed set in a sense they gave plan-id and took a while to materialize, until the client polls, one they are done producing they just give the whole list of materialized, then they give it one go (I totally agree, we are still constrained by size of the payload here), which would be ok to it in spark side semantics where we wait to collect all the splits on the driver before handling of to the executor, rather than having a streaming splits like Trino ?
I can say this falls under example of "bad architecture" but i believe these are possible as of today.

I think one other way to put is its the server who is orchestrating the client to come to them by giving them the planTask, they are in their full liberty to not do so and achieve the architecture they want and co-exit and from the spec / ref impl perspective we can take a stand that this is what we recommend, though i feel its a bit strict but i can understand how this simplifies things

@nastra, this doesn't mean the fetch endpoint actually has to do anything. You could always return all results from plan and avoid hitting it. But we don't need to increase the complexity of all clients.

@rdblue I believe this should be acceptable for implementors as long as we clearly document the requirements, so I'm in favor of doing this

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

nastra · 2026-03-13T11:46:59Z

core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java

      TableOperations ops, TableIdentifier finalIdentifier, RESTClient restClient) {
    // server supports remote planning endpoint and server / client wants to do server side planning
    if (endpoints.contains(Endpoint.V1_SUBMIT_TABLE_SCAN_PLAN) && restScanPlanningEnabled) {
+      boolean supportsAsync = endpoints.contains(Endpoint.V1_FETCH_TABLE_SCAN_PLAN);


should we maybe name this supportsFetchPlanningResult to be more precise? async could mean different things to different people and the endpoint that we're referring to here is the one that returns the planning result for a given planId

I would say being able to handle the "submitted" but not "completed" status is a reasonable definition of "async". And I wouldn't say supportsFetchPlanningResult is actually more clear. In the planning APIs, we use "fetch" to refer to the endpoint where you fetch the result of a plan task. I personally find it hard when I see "fetch planning result" vs "fetch plan task result". These are too similar to be helpful.

Another concern here is that we don't want people to think that a reasonable implementation may use the async endpoint to plan an entire scan and send it back as one result, rather than breaking up the planning operation and using multiple calls to fetch the results. That would be a really bad use of this API and is another good reason to make fetch required and async optional.

core/src/main/java/org/apache/iceberg/rest/RESTScanContext.java

rdblue · 2026-03-13T20:36:01Z

core/src/main/java/org/apache/iceberg/rest/RESTScanContext.java

+      TableIdentifier tableIdentifier,
+      boolean supportsAsync,
+      boolean supportsCancel,
+      boolean supportsFetchTasks,


I appreciate that you're turning around these updates so quickly! I was thinking about this a little differently, so I'm sorry that I should have been a bit more clear with my suggestion.

What I was thinking is a class that behaves like ResourcePaths, but at the table level to encapsulate logic that doesn't need to be spread out. Something like a TableResource, which would be constructed from ResourcePaths, TableIdentifier, and Set<Endpoint> for table-specific endpoint checks like supportsAsync.

tableResource.planPath() would return the planning submit path

tableResource.planPath(planId) would return the async endpoint to poll for the give plan ID

tableResource.fetchPath() would return the endpoint for fetching plan task results

I think of this as a "table resource" because it could handle more than just scan planning. For instance, the code for endpoint checks is also littered throughout RESTTableOperations. I would want RESTTableOperations to also delegate as much as possible to the table resource object as well, rather than keeping a copy of the supported endpoints. This wouldn't need to happen right now, but it's how I think about what I'm proposing.

Does that make sense? Do you think it would be a good direction?

I was just trying to get this in ASAP (keeping 1.11 in mind) :) !

This makes sense to me, over all ! Though some questions inline:

for table-specific endpoint checks like supportsAsync.

Do you envision these to be kind of return null when things are not applicable, we can add in the function contract.

Something like a TableResource, which would be constructed from

Do you think, we would need something like Namespace resource, i am mostly thinking if there is an oppurtunity for having a BaseResource class where i can put the CRUD related call in ti.

+1 on moving this stuff out of RESTTableOps, will think more

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

rdblue · 2026-03-13T22:15:39Z

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

-                .map(c -> StorageCredential.create(c.prefix(), c.config()))
-                .collect(Collectors.toList()));
+  private FileIO scanFileIO(List<Credential> credentials) {
+    FileIO ioForScan = scanContext.createFileIO(credentials, planId);


This looks reasonable.

rdblue · 2026-03-13T22:19:43Z

core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java

    assertThatThrownBy(restTableScanFor(table)::planFiles)
-        .isInstanceOf(UnsupportedOperationException.class)
-        .hasMessage("Server does not support endpoint: %s", Endpoint.V1_FETCH_TABLE_SCAN_PLAN);
+        .isInstanceOf(IllegalStateException.class)


I think this is correct. If a service puts the client in a submitted state but can't complete the operation after that, it's an illegal state.

core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java

- Remove redundant fields (table, operations, tableIO) from RESTTableScan - Replace catalogProperties + hadoopConf with a FileIO factory function - Pre-compute resource paths instead of passing ResourcePaths + TableIdentifier - Replace Set<Endpoint> with supportsAsync/supportsCancel/supportsFetchTasks booleans - Change headers from Map to Supplier for consistency with RESTTable - Fix newRefinedScan to not propagate plan-scoped FileIO - Fix cleanupPlanResources to always invalidate tracker

- Use PlanStatus.SUBMITTED in error message for consistency - Make ParserContext lazily initialized to avoid recreation on newRefinedScan() - Change fileIOFactory signature to accept Credential and planId directly, moving StorageCredential conversion into the RESTTable lambda - Introduce RESTScanContext to collapse path/boolean/factory constructor params

…back - Rename RESTScanContext to TableResource, accepting Set<Endpoint> instead of pre-resolved booleans for future reuse - Rename methods: planTableScanPath() to planPath(), fetchScanTasksPath() to fetchPath() - Rename parserContext field to lazyParserContext to clarify lazy init - Remove supportsFetchTasks check (fetch endpoint is now required) - Simplify RESTTable constructor from 12 to 10 params

github-actions bot added the core label Mar 12, 2026

nastra mentioned this pull request Mar 12, 2026

API, Core: Add FileIO to Scan API #15561

Merged

rdblue reviewed Mar 12, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 12, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 12, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 12, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java Outdated Show resolved Hide resolved

singhpk234 marked this pull request as ready for review March 13, 2026 01:31

singhpk234 requested a review from nastra March 13, 2026 06:11

sfc-gh-prsingh force-pushed the feature/clean-up-rest-scan-planning branch from 21e8da9 to cffc400 Compare March 13, 2026 06:24

nastra reviewed Mar 13, 2026

View reviewed changes

rdblue reviewed Mar 13, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTScanContext.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTScanContext.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 13, 2026

View reviewed changes

core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java Outdated Show resolved Hide resolved

sfc-gh-prsingh added 3 commits March 20, 2026 22:26

sfc-gh-prsingh force-pushed the feature/clean-up-rest-scan-planning branch from cffc400 to ff6de18 Compare March 21, 2026 06:13

Conversation

singhpk234 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

singhpk234 commented Mar 12, 2026 •

edited

Loading

singhpk234 Mar 12, 2026 •

edited

Loading

singhpk234 Mar 14, 2026 •

edited

Loading