feat(gcp): Add Hierarchical Namespace (HNS) support for GCS bucket #3996
feat(gcp): Add Hierarchical Namespace (HNS) support for GCS bucket #3996varunarya002 wants to merge 4 commits intoapache:mainfrom
Conversation
0bde95d to
119b101
Compare
dimas-b
left a comment
There was a problem hiding this comment.
Thanks for your contribution, @varunarya002 !
Since this PR affects REST API parameters, please open a corresponding discussion on the dev ML, which is a usual practice in Polaris.
(this is not a complete review from my side 😅 ... just noting a couple of points for a start)
| gcsServiceAccount: | ||
| type: string | ||
| description: a Google cloud storage service account | ||
| hierarchicalNamespace: |
There was a problem hiding this comment.
nit: a similar field in Azure is called simply hierarchical
|
@varunarya002 : Do you have capacity to push this PR forward? If you do not have time, it's fine. We can build something based of your work and still give you attribution. |
|
Hi @dimas-b. I will be starting discussion about this change on dev ML. |
918e6d4 to
f43e793
Compare
| JsonNode parsedRules = mapper.convertValue(credentialAccessBoundary, JsonNode.class); | ||
| JsonNode refRules = readResource(mapper, "gcp-testGenerateAccessBoundaryHnsEnabled.json"); | ||
| assertThat(parsedRules) | ||
| .usingRecursiveComparison( |
There was a problem hiding this comment.
nit: does simple equals not work in this case?
|
|
||
| public StorageLocationPreparer create(@Nonnull PolarisStorageConfigurationInfo storageConfig) { | ||
| if (storageConfig instanceof GcpStorageConfigurationInfo && storageConfiguration != null) { | ||
| return new GcsStorageLocationPreparer(storageConfiguration.gcpCredentialsSupplier(clock)); |
There was a problem hiding this comment.
Could we use StorageAccessConfigProvider for storage credentials?
There was a problem hiding this comment.
It's not feasible because of different trust levels (admin creds for HNS folder creation vs subscoped client tokens).
There was a problem hiding this comment.
Hmmm... sorry, but I'm a bit confused now... Clients (e.g. Spark) should be able to create folders (for new data files with Iceberg partitioning) using the vended credentials, right?
| () -> | ||
| Optional.ofNullable( | ||
| tableProperties.get( | ||
| IcebergTableLikeEntity.USER_SPECIFIED_WRITE_METADATA_LOCATION_KEY))) |
There was a problem hiding this comment.
This does not look specific to GCS. Could we handle this part in a more general way?
There was a problem hiding this comment.
Extracted AbstractStorageLocationPreparer base class with generic URI parsing, hierarchy building, bucket grouping.
| import java.util.Map; | ||
|
|
||
| public interface StorageLocationPreparer { | ||
| void prepareTableLocation(String tableLocation, Map<String, String> tableProperties); |
There was a problem hiding this comment.
I wonder if we could simplify this SPI 🤔 I suppose we could simply request a certain set of folders to be created. Do you envision more sophisticated work to be done (in some future cases) for preparing table locations?
There was a problem hiding this comment.
Changed to void prepareLocations(List) — no Iceberg concepts leak into the SPI.
| } | ||
|
|
||
| public StorageLocationPreparer create(@Nonnull PolarisStorageConfigurationInfo storageConfig) { | ||
| if (storageConfig instanceof GcpStorageConfigurationInfo && storageConfiguration != null) { |
There was a problem hiding this comment.
Let's leverage CDI via storage type-based @Identifier annotations... similar to ServiceProducers.polarisAuthorizerFactory()
There was a problem hiding this comment.
Factory uses Instance lookup by StorageType.name(). GCS preparer annotated with @Identifier("GCS").
Co-Authored-By: claude-flow <ruv@ruv.net>
1ccdfce to
561f90d
Compare
|
Related |
| CredentialAccessBoundary.AccessBoundaryRule.AvailabilityCondition.newBuilder() | ||
| .setExpression(String.join(" || ", folderConditions)) | ||
| .build()); | ||
| builder.setAvailablePermissions(List.of("inRole:roles/storage.folderAdmin")); |
There was a problem hiding this comment.
We add this unconditionally in this code, but it is required only for HNS, right?
I suppose it would be more robust to check resolveHnsStatus() here or have an HNS flag in the Storage Config.... I'm kind of leaning toward the flag, even though we can auto-detect this... WDYT?
My thinking is that users are generally aware of the the HNS flag in their GCS storage (or al least they should be) and it is not likely to change in runtime.
There was a problem hiding this comment.
I agree with @dimas-b 's comment - is it a concern that we are introducing this permission for any GCS backed tables? That IMO goes against the principle of granting least-privileged access, so I think it would be helpful to review if that exposes any unnecessary actions for non HNS enabled tables.
sungwy
left a comment
There was a problem hiding this comment.
Hi @varunarya002 thank you so much for putting together this PR. The new StorageLocationPreparer abstraction is a highly reusable idea, and I think it is directionally correct. I took a first pass at reviewing this PR and left some comments.
Also, could we add a bit more context in the PR description about this model? We are effectively introducing a pre-createTable hook into the table creation flow, so it might be worth calling that out explicitly.
| * resolve folder hierarchies, group by bucket, and delegate to {@link | ||
| * #createFoldersForBucket(String, List)} for storage-specific operations. | ||
| */ | ||
| public abstract class AbstractStorageLocationPreparer implements StorageLocationPreparer { |
There was a problem hiding this comment.
I feel this name is a bit too abstract (pun intended 🙂). Would it make sense to give it a more descriptive name like HierarchicalFolderLocationPreparer or ObjectStoreFolderPreparer?
I also wonder whether introducing this abstraction is a bit premature. Do we expect these methods to be reusable across other cloud providers, or is this really tailored to the GCS HNS case for now?
If it is mainly the latter, would it make sense to fold this into GcsStorageLocationPreparer for now and only extract a shared abstraction once we have a second provider implementation and can validate the actual common ground?
| return NO_OP; | ||
| } | ||
| String key = storageConfig.getStorageType().name(); | ||
| Instance<StorageLocationPreparer> selected = preparers.select(Identifier.Literal.of(key)); |
There was a problem hiding this comment.
I like the StorageLocationPreparer as an abstraction, but instead of introducing it as a separate standalone selection path, would it make more sense for it to be exposed by PolarisStorageIntegration instead?
PolarisStorageIntegration is already selected from the storage configuration type, so grouping storage-specific behavior there seems cleaner and makes the config-driven selection more obvious and consistent.
I do think it still makes sense to model StorageLocationPreparer as a separate capability. I’m just wondering whether PolarisStorageIntegration should be single the interface that exposes that capability, rather than introducing a parallel factory and dispatch path for storage-specific behavior.
| CredentialAccessBoundary.AccessBoundaryRule.AvailabilityCondition.newBuilder() | ||
| .setExpression(String.join(" || ", folderConditions)) | ||
| .build()); | ||
| builder.setAvailablePermissions(List.of("inRole:roles/storage.folderAdmin")); |
There was a problem hiding this comment.
I agree with @dimas-b 's comment - is it a concern that we are introducing this permission for any GCS backed tables? That IMO goes against the principle of granting least-privileged access, so I think it would be helpful to review if that exposes any unnecessary actions for non HNS enabled tables.
Issue:
GCS buckets with HNS enabled require explicit folder creation and storage.folderAdmin permissions for managed folder operations. Without this, Iceberg table creation and Spark ingestion fail on HNS buckets.
Changes:
Tests:
Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)