GCP: Add scheduled refresh for storage credentials held by GCSFileIO#15696
GCP: Add scheduled refresh for storage credentials held by GCSFileIO#15696nastra wants to merge 1 commit intoapache:mainfrom
Conversation
9574bba to
09f1ad0
Compare
singhpk234
left a comment
There was a problem hiding this comment.
The changes LGTM !
Have some optional suggestions, when seeing things from fresh set of eyes !
| } | ||
|
|
||
| try (OAuth2RefreshCredentialsHandler handler = | ||
| OAuth2RefreshCredentialsHandler.create(properties)) { |
There was a problem hiding this comment.
creating a OAuth2RefreshCredentialsHandler per refresh call would be expensive, it creates a http client and an AuthManager additionally, i wonder if we can just have this done one (invalidated / created again if props change) and just reuse that across refreshes ?
There was a problem hiding this comment.
I don't feel like this is too expensive. Typically, we're talking about possibly one invocation per hour (most queries will likely have none because the run within the timeout), so holding all of the http client thread pools and other resources is the more wasteful approach.
There was a problem hiding this comment.
Typically, we're talking about possibly one invocation per hour (most queries will likely have none because the run within the timeout)
If we create the handler lazily, we can achieve the same.
holding all of the http client thread pools and other resources is the more wasteful approach
should we trim down the http client pool to be way less than ~100 in the first place ? we technically don't need the whole HTTP Client pool (~100) for OAuth2RefreshCredsHandler ? its gonna be mostly call to rest server by picking a connection from the pool .
My main suggestion was we are creating burst of connections / threads which we are discarding anyways after the credentials call completes, hence i though it was an expensive
| .toList(); | ||
|
|
||
| if (!refreshed.isEmpty() && !isResourceClosed.get()) { | ||
| this.storageCredentials = Lists.newArrayList(refreshed); |
There was a problem hiding this comment.
I think we need to schedule a new refresh at this point. Otherwise we never refresh after the first time.
The
GCSFileIOimplementation never refreshes the credentials that are held directly from the table load. The GCS storage clients use aOAuth2RefreshCredentialsHandlerthat internally refreshes the storage client, but those updates are not reflected back to the FileIO.If an
GCSFileIOinstance is serialized to remote workers, the credentials may be expired triggering a thundering herd of requests to refresh immediately.This change addresses this problem by proactively updating the credentials in the FileIO so that only valid credentials are propagated to remote clients.
This applies the same changes as was already done for S3FileIO in #15678