Skip to content

feat(gcp): Add Hierarchical Namespace (HNS) support for GCS bucket #3996

Open
varunarya002 wants to merge 4 commits intoapache:mainfrom
varunarya002:gcp_hns_bucket
Open

feat(gcp): Add Hierarchical Namespace (HNS) support for GCS bucket #3996
varunarya002 wants to merge 4 commits intoapache:mainfrom
varunarya002:gcp_hns_bucket

Conversation

@varunarya002
Copy link
Copy Markdown

@varunarya002 varunarya002 commented Mar 13, 2026

Issue:
GCS buckets with HNS enabled require explicit folder creation and storage.folderAdmin permissions for managed folder operations. Without this, Iceberg table creation and Spark ingestion fail on HNS buckets.

Changes:

  • Add hierarchicalNamespace boolean field to GcpStorageConfigInfo (OpenAPI spec)
  • Add @nullable Boolean isHierarchicalNamespace() to GcpStorageConfigurationInfo
  • Generate storage.folderAdmin access boundary rule scoped to managedFolders/{path} for write locations when HNS is enabled
  • Plumb HNS flag through CatalogEntity model<->entity conversion
  • Add backward-compatible 3-param overload defaulting HNS to false

Tests:

  • HNS enabled single bucket (read + write + folder rules)
  • HNS with multiple buckets and partial writes
  • HNS without writes (no folder rules)
  • HNS with separate metadata and data buckets
  • HNS same bucket with separate metadata/data paths
  • Non-HNS vs HNS rule count comparison
  • Config serialization round-trip
  • CatalogEntity round-trip with HNS flag

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

@github-project-automation github-project-automation Bot moved this to PRs In Progress in Basic Kanban Board Mar 13, 2026
@varunarya002 varunarya002 changed the title feat(gcp): Add Hierarchical Namespace (HNS) support for GCS bucket cr… feat(gcp): Add Hierarchical Namespace (HNS) support for GCS bucket Mar 13, 2026
Copy link
Copy Markdown
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @varunarya002 !

Since this PR affects REST API parameters, please open a corresponding discussion on the dev ML, which is a usual practice in Polaris.

(this is not a complete review from my side 😅 ... just noting a couple of points for a start)

Comment thread spec/polaris-management-service.yml Outdated
gcsServiceAccount:
type: string
description: a Google cloud storage service account
hierarchicalNamespace:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a similar field in Azure is called simply hierarchical

@dimas-b
Copy link
Copy Markdown
Contributor

dimas-b commented Apr 1, 2026

@varunarya002 : Do you have capacity to push this PR forward? If you do not have time, it's fine. We can build something based of your work and still give you attribution.

@varunarya002
Copy link
Copy Markdown
Author

Hi @dimas-b. I will be starting discussion about this change on dev ML.

@varunarya002 varunarya002 force-pushed the gcp_hns_bucket branch 5 times, most recently from 918e6d4 to f43e793 Compare April 5, 2026 08:32
JsonNode parsedRules = mapper.convertValue(credentialAccessBoundary, JsonNode.class);
JsonNode refRules = readResource(mapper, "gcp-testGenerateAccessBoundaryHnsEnabled.json");
assertThat(parsedRules)
.usingRecursiveComparison(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does simple equals not work in this case?


public StorageLocationPreparer create(@Nonnull PolarisStorageConfigurationInfo storageConfig) {
if (storageConfig instanceof GcpStorageConfigurationInfo && storageConfiguration != null) {
return new GcsStorageLocationPreparer(storageConfiguration.gcpCredentialsSupplier(clock));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use StorageAccessConfigProvider for storage credentials?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not feasible because of different trust levels (admin creds for HNS folder creation vs subscoped client tokens).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... sorry, but I'm a bit confused now... Clients (e.g. Spark) should be able to create folders (for new data files with Iceberg partitioning) using the vended credentials, right?

() ->
Optional.ofNullable(
tableProperties.get(
IcebergTableLikeEntity.USER_SPECIFIED_WRITE_METADATA_LOCATION_KEY)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look specific to GCS. Could we handle this part in a more general way?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted AbstractStorageLocationPreparer base class with generic URI parsing, hierarchy building, bucket grouping.

import java.util.Map;

public interface StorageLocationPreparer {
void prepareTableLocation(String tableLocation, Map<String, String> tableProperties);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could simplify this SPI 🤔 I suppose we could simply request a certain set of folders to be created. Do you envision more sophisticated work to be done (in some future cases) for preparing table locations?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to void prepareLocations(List) — no Iceberg concepts leak into the SPI.

}

public StorageLocationPreparer create(@Nonnull PolarisStorageConfigurationInfo storageConfig) {
if (storageConfig instanceof GcpStorageConfigurationInfo && storageConfiguration != null) {
Copy link
Copy Markdown
Contributor

@dimas-b dimas-b Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leverage CDI via storage type-based @Identifier annotations... similar to ServiceProducers.polarisAuthorizerFactory()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factory uses Instance lookup by StorageType.name(). GCS preparer annotated with @Identifier("GCS").

@dimas-b
Copy link
Copy Markdown
Contributor

dimas-b commented Apr 16, 2026

CredentialAccessBoundary.AccessBoundaryRule.AvailabilityCondition.newBuilder()
.setExpression(String.join(" || ", folderConditions))
.build());
builder.setAvailablePermissions(List.of("inRole:roles/storage.folderAdmin"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add this unconditionally in this code, but it is required only for HNS, right?

I suppose it would be more robust to check resolveHnsStatus() here or have an HNS flag in the Storage Config.... I'm kind of leaning toward the flag, even though we can auto-detect this... WDYT?

My thinking is that users are generally aware of the the HNS flag in their GCS storage (or al least they should be) and it is not likely to change in runtime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dimas-b 's comment - is it a concern that we are introducing this permission for any GCS backed tables? That IMO goes against the principle of granting least-privileged access, so I think it would be helpful to review if that exposes any unnecessary actions for non HNS enabled tables.

Copy link
Copy Markdown
Contributor

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @varunarya002 thank you so much for putting together this PR. The new StorageLocationPreparer abstraction is a highly reusable idea, and I think it is directionally correct. I took a first pass at reviewing this PR and left some comments.

Also, could we add a bit more context in the PR description about this model? We are effectively introducing a pre-createTable hook into the table creation flow, so it might be worth calling that out explicitly.

* resolve folder hierarchies, group by bucket, and delegate to {@link
* #createFoldersForBucket(String, List)} for storage-specific operations.
*/
public abstract class AbstractStorageLocationPreparer implements StorageLocationPreparer {
Copy link
Copy Markdown
Contributor

@sungwy sungwy Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this name is a bit too abstract (pun intended 🙂). Would it make sense to give it a more descriptive name like HierarchicalFolderLocationPreparer or ObjectStoreFolderPreparer?

I also wonder whether introducing this abstraction is a bit premature. Do we expect these methods to be reusable across other cloud providers, or is this really tailored to the GCS HNS case for now?

If it is mainly the latter, would it make sense to fold this into GcsStorageLocationPreparer for now and only extract a shared abstraction once we have a second provider implementation and can validate the actual common ground?

return NO_OP;
}
String key = storageConfig.getStorageType().name();
Instance<StorageLocationPreparer> selected = preparers.select(Identifier.Literal.of(key));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the StorageLocationPreparer as an abstraction, but instead of introducing it as a separate standalone selection path, would it make more sense for it to be exposed by PolarisStorageIntegration instead?

PolarisStorageIntegration is already selected from the storage configuration type, so grouping storage-specific behavior there seems cleaner and makes the config-driven selection more obvious and consistent.

I do think it still makes sense to model StorageLocationPreparer as a separate capability. I’m just wondering whether PolarisStorageIntegration should be single the interface that exposes that capability, rather than introducing a parallel factory and dispatch path for storage-specific behavior.

CredentialAccessBoundary.AccessBoundaryRule.AvailabilityCondition.newBuilder()
.setExpression(String.join(" || ", folderConditions))
.build());
builder.setAvailablePermissions(List.of("inRole:roles/storage.folderAdmin"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dimas-b 's comment - is it a concern that we are introducing this permission for any GCS backed tables? That IMO goes against the principle of granting least-privileged access, so I think it would be helpful to review if that exposes any unnecessary actions for non HNS enabled tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants