Skip to content

Core: Simplify RESTTableScan by removing catalog internals#15595

Open
singhpk234 wants to merge 3 commits intoapache:mainfrom
singhpk234:feature/clean-up-rest-scan-planning
Open

Core: Simplify RESTTableScan by removing catalog internals#15595
singhpk234 wants to merge 3 commits intoapache:mainfrom
singhpk234:feature/clean-up-rest-scan-planning

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Mar 12, 2026

Address #15561 (comment)

  • Remove redundant fields (table, operations, tableIO) from RESTTableScan
  • Replace catalogProperties + hadoopConf with a FileIO factory function
  • Pre-compute resource paths instead of passing ResourcePaths + TableIdentifier
  • Replace Set with supportsAsync/supportsCancel/supportsFetchTasks booleans
  • Change headers from Map to Supplier for consistency with RESTTable
  • Fix newRefinedScan to not propagate plan-scoped FileIO
  • Fix cleanupPlanResources to always invalidate tracker

private final String fetchScanTasksPath;
private final boolean supportsAsync;
private final boolean supportsCancel;
private final boolean supportsFetchTasks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a boolean? I think it should be required and planning should fail if task fetching is not supported. @nastra, what do you think? Is it worth supporting cases where plan is implemented but not fetch? That would mean that planning has to be all-at-once, which is a bad design!

Copy link
Contributor Author

@singhpk234 singhpk234 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth supporting cases where plan is implemented but not fetch

It could be possible i planned once already and the plan is cached and i can send all-at-once it could be a use case, and i don;t need to do support fetchTask if my planning is fast or its only supported for small tables wdyt ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I think it is reasonable to only support the plan endpoint without having immediate support for the tasks endpoint (independently from whether such a design results in good perf). It is likely that servers implementing remote scan planning will start with the easy endpoint first (plan) and later add the async planning part and I think we should at least not mandate that all endpoints must be supported immediately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhpk234, wouldn't you cache the results of each request? The motivation for breaking scan planning into multiple requests isn't just the cost of reading manifests. It is also the cost of sending results (you could have thousands of matching files per plan task) and caching them (if you have thousands per request). I doubt that anyone would want to do the work of aggregating plan task results in a cache only to have a more coarse cache that invalidates an entire planning result rather that just parts of it.

My concern here is also motivated by being able to keep things simple. We don't want new REST catalog features that require gracefully falling back for every single endpoint. That's needless complexity. In this case, as we add planning we need to document which parts are required and which ones are optional. I think that supporting both plan and fetch is a minimum for getting this working and we can simplify by asserting that.

@nastra, this doesn't mean the fetch endpoint actually has to do anything. You could always return all results from plan and avoid hitting it. But we don't need to increase the complexity of all clients.

Copy link
Contributor Author

@singhpk234 singhpk234 Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can prewarm the caches is one thing i was thinking for small tables ? orthogonally there may be cases where some one is producing a governed set in a sense they gave plan-id and took a while to materialize, until the client polls, one they are done producing they just give the whole list of materialized, then they give it one go (I totally agree, we are still constrained by size of the payload here), which would be ok to it in spark side semantics where we wait to collect all the splits on the driver before handling of to the executor, rather than having a streaming splits like Trino ?
I can say this falls under example of "bad architecture" but i believe these are possible as of today.

I think one other way to put is its the server who is orchestrating the client to come to them by giving them the planTask, they are in their full liberty to not do so and achieve the architecture they want and co-exit and from the spec / ref impl perspective we can take a stand that this is what we recommend, though i feel its a bit strict but i can understand how this simplifies things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, this doesn't mean the fetch endpoint actually has to do anything. You could always return all results from plan and avoid hitting it. But we don't need to increase the complexity of all clients.

@rdblue I believe this should be acceptable for implementors as long as we clearly document the requirements, so I'm in favor of doing this

@singhpk234 singhpk234 marked this pull request as ready for review March 13, 2026 01:31
@singhpk234 singhpk234 requested a review from nastra March 13, 2026 06:11
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/clean-up-rest-scan-planning branch from 21e8da9 to cffc400 Compare March 13, 2026 06:24
TableOperations ops, TableIdentifier finalIdentifier, RESTClient restClient) {
// server supports remote planning endpoint and server / client wants to do server side planning
if (endpoints.contains(Endpoint.V1_SUBMIT_TABLE_SCAN_PLAN) && restScanPlanningEnabled) {
boolean supportsAsync = endpoints.contains(Endpoint.V1_FETCH_TABLE_SCAN_PLAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe name this supportsFetchPlanningResult to be more precise? async could mean different things to different people and the endpoint that we're referring to here is the one that returns the planning result for a given planId

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say being able to handle the "submitted" but not "completed" status is a reasonable definition of "async". And I wouldn't say supportsFetchPlanningResult is actually more clear. In the planning APIs, we use "fetch" to refer to the endpoint where you fetch the result of a plan task. I personally find it hard when I see "fetch planning result" vs "fetch plan task result". These are too similar to be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern here is that we don't want people to think that a reasonable implementation may use the async endpoint to plan an entire scan and send it back as one result, rather than breaking up the planning operation and using multiple calls to fetch the results. That would be a really bad use of this API and is another good reason to make fetch required and async optional.

TableIdentifier tableIdentifier,
boolean supportsAsync,
boolean supportsCancel,
boolean supportsFetchTasks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that you're turning around these updates so quickly! I was thinking about this a little differently, so I'm sorry that I should have been a bit more clear with my suggestion.

What I was thinking is a class that behaves like ResourcePaths, but at the table level to encapsulate logic that doesn't need to be spread out. Something like a TableResource, which would be constructed from ResourcePaths, TableIdentifier, and Set<Endpoint> for table-specific endpoint checks like supportsAsync.

  • tableResource.planPath() would return the planning submit path
  • tableResource.planPath(planId) would return the async endpoint to poll for the give plan ID
  • tableResource.fetchPath() would return the endpoint for fetching plan task results

I think of this as a "table resource" because it could handle more than just scan planning. For instance, the code for endpoint checks is also littered throughout RESTTableOperations. I would want RESTTableOperations to also delegate as much as possible to the table resource object as well, rather than keeping a copy of the supported endpoints. This wouldn't need to happen right now, but it's how I think about what I'm proposing.

Does that make sense? Do you think it would be a good direction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying to get this in ASAP (keeping 1.11 in mind) :) !

This makes sense to me, over all ! Though some questions inline:

for table-specific endpoint checks like supportsAsync.

Do you envision these to be kind of return null when things are not applicable, we can add in the function contract.

Something like a TableResource, which would be constructed from

Do you think, we would need something like Namespace resource, i am mostly thinking if there is an oppurtunity for having a BaseResource class where i can put the CRUD related call in ti.

+1 on moving this stuff out of RESTTableOps, will think more

.map(c -> StorageCredential.create(c.prefix(), c.config()))
.collect(Collectors.toList()));
private FileIO scanFileIO(List<Credential> credentials) {
FileIO ioForScan = scanContext.createFileIO(credentials, planId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable.

assertThatThrownBy(restTableScanFor(table)::planFiles)
.isInstanceOf(UnsupportedOperationException.class)
.hasMessage("Server does not support endpoint: %s", Endpoint.V1_FETCH_TABLE_SCAN_PLAN);
.isInstanceOf(IllegalStateException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. If a service puts the client in a submitted state but can't complete the operation after that, it's an illegal state.

- Remove redundant fields (table, operations, tableIO) from RESTTableScan
- Replace catalogProperties + hadoopConf with a FileIO factory function
- Pre-compute resource paths instead of passing ResourcePaths + TableIdentifier
- Replace Set<Endpoint> with supportsAsync/supportsCancel/supportsFetchTasks booleans
- Change headers from Map to Supplier for consistency with RESTTable
- Fix newRefinedScan to not propagate plan-scoped FileIO
- Fix cleanupPlanResources to always invalidate tracker
- Use PlanStatus.SUBMITTED in error message for consistency
- Make ParserContext lazily initialized to avoid recreation on newRefinedScan()
- Change fileIOFactory signature to accept Credential and planId directly,
  moving StorageCredential conversion into the RESTTable lambda
- Introduce RESTScanContext to collapse path/boolean/factory constructor params
…back

- Rename RESTScanContext to TableResource, accepting Set<Endpoint>
  instead of pre-resolved booleans for future reuse
- Rename methods: planTableScanPath() to planPath(), fetchScanTasksPath() to fetchPath()
- Rename parserContext field to lazyParserContext to clarify lazy init
- Remove supportsFetchTasks check (fetch endpoint is now required)
- Simplify RESTTable constructor from 12 to 10 params
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/clean-up-rest-scan-planning branch from cffc400 to ff6de18 Compare March 21, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants