Skip to content

A107 - TLS Private Key Offloading#524

Open
gtcooke94 wants to merge 5 commits intogrpc:masterfrom
gtcooke94:tls_offloading
Open

A107 - TLS Private Key Offloading#524
gtcooke94 wants to merge 5 commits intogrpc:masterfrom
gtcooke94:tls_offloading

Conversation

@gtcooke94
Copy link
Contributor

@gtcooke94 gtcooke94 commented Dec 1, 2025

gRFC for TLS Private Key Offloading

// Note that when a signature of a hash of a larger message is needed,
// the caller is responsible for hashing the larger message and passing
// the hash (as digest) and the hash function (as opts) to Sign.
Sign(rand io.Reader, digest []byte, opts SignerOpts) (signature []byte, err error)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an obscure usecase for supporting MessageSigner here for restricted TPM keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up - I took a look at the change and I think someone implementing crypto.MessageSigner, and I think it should "just work" since crypto/tls supports a crypto.MessageSigner as a PrivateKey in the tls.Certificate.

So I think gRPC will pretty blindly pass this through while configuring tls.Config. If the user implements a crypto.MessageSigner with a SignMessage it'll use that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if anything implements messagesigner, it'd get automatically pickedup and used.

as mentioned internally, that the advancedtls pakage which AFAIK, is only in go, you can offload TLS now

@dfawley dfawley requested a review from easwars December 19, 2025 18:39
The most complex piece of this is the implementation \- C-Core/Cython/Python must handle the calling of the user provided Python signing function from C which must invoke a callback that is passed to it. This will involve creating bridge types between the Python user sign function and the expected `absl::AnyInvocable` as well as bridging the callback that is passed to the user sign function while managing the GIL and asynchronous nature of the signing. This is technically feasible with Cython. A proof-of-concept of this structure [is written here.](https://source.corp.google.com/piper///depot/google3/experimental/users/gregorycooke/python_cpp_wrapping/)

```py
# Example Usage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to restart the conversation we had elsewhere.

Would it make sense to abstract this call away? Make the signer just return the bytes, and the callback will be handled by our custom function? With this approach, we'll avoid the problem where users forgot to call the callback (f.e. by not wrapping their method into try ... finally).

The main point is to abstract away internal implementation details as much as possible.

# Internal wrapper
def _grpc_invoke_custom_private_key_sign(
    private_key_sign_fn: CustomPrivateKeySign,
    unsigned_data: bytes,
    algorithm: SignatureAlgorithm,
    on_done: PrivateKeySignDoneCallback,
) -> None:
    try:
        signed_bytes = private_key_sign_fn(unsigned_data, algorithm)
        if signed_bytes:
            on_done(signed_bytes, True)
        else:
            on_done(None, False)
    except Exception as e:
        logging.warning("Exception during custom private key sign: %s", e)
        on_done(None, False)


# User's implementation example
def example_signer(
    unsigned_data: bytes, algorithm: SignatureAlgorithm
) -> Optional[bytes]:
    # Manually sign the bytes.
    signed_bytes = ...
    return signed_bytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with @matthewstevenson88 - he explained some of the async patterns our users might apply when CustomPrivateKeySign. One of them is creating a thread that will do the computation, passing on_done callback to the thread, while immediately returning from the CustomPrivateKeySign. This is a good argument for not introducing an internal wrapper.

I need think about this some more, and consider a few nuances:

  • Whether a thread will actually be useful for offloading CPU-bound tasks - because if GIL is stuck on this thread waiting for the signing to complete, there won't be any parallelization/concurrency (unless it calls into a Python C extension that disables GIL)
  • Whether we still can use the wrapper approach, but with a different interface, like using the Future interface. Future should make it easier for the user to work with various PoolExecutors.
  • Whether we should also allow an async method signature (make private_key_sign_fn a Union[CustomPrivateKeySign, CustomPrivateKeySignAsync]) - which may be required if it's common for the HW offloading libraries the do the signing to use async/await pattern (asyncio python).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reference,

GIL and performance considerations

Unlike the multiprocessing module, which uses separate processes to bypass the global interpreter lock (GIL), the threading module operates within a single process, meaning that all threads share the same memory space. However, the GIL limits the performance gains of threading when it comes to CPU-bound tasks, as only one thread can execute Python bytecode at a time. Despite this, threads remain a useful tool for achieving concurrency in many scenarios.

https://docs.python.org/3/library/threading.html#gil-and-performance-considerations

Copy link
Contributor Author

@gtcooke94 gtcooke94 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had the chance to consider these options? We need to move relatively quickly on this, and that prototype that I made earlier in the year seemed viable? Can we keep this and disable the GIL?

Can we allow for Future or asyncio with the cython integration?

I'll lean on you for these decisions since you are the Python expert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been toying around with how we wrap this more

The C++ layers knows it is wrapping Python. What if we just make the C++ wrapper layer responsible for the async.

i.e.

... implementing the interface ...
class PrivateKeySignerPyWrapper : PrivateKeySigner {

... details, assume self_->sign() is calling a cython method that's holding a python callable and is going to call into Python, here's where we _actually_ call it  ...
  
  thread::Detach({}, [s = std::string(input), self = shared_from_this(),
                       context] {
    PyGILState_STATE gstate;
    gstate = PyGILState_Ensure();
    // This is calling into Python
    self->sign_(s, CompletionCallbackForPy, context, self->user_data_);
    PyGILState_Release(gstate);
  });

}

We could go even further in this case, the python sign function doesn't need to know about a callback, calling the callback, or handling async stuff - it just simply returns signed data sync.

  thread::Detach({}, [s = std::string(input), self = shared_from_this(),
                       context] {
    PyGILState_STATE gstate;
    gstate = PyGILState_Ensure();
    // This is calling into Python
    std::string signed_data = self->sign_(s, context, self->user_data_);
    PyGILState_Release(gstate);
    // C++ wrapping layer is responsible for the async and calling the callback
    CompletionCallback(signed_data);
  });

// To use, set the provider on the TlsCredentialsOptions
```

### Python
Copy link
Member

@sergiitk sergiitk Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that gevent is not supported, both in the gRFC and the ssl_channel_credentials_with_custom_signer docstring.


```

We won't significantly refactor the Python API surface \- instead we will allow the `private_key` input to be a signing function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this in relevant in the last iteration - we provided a new method instead of modifying the existing one to support private_key


# Now the user is in their application configuring gRPC
# Create creds with the custom signer
creds = ssl_channel_credentials_with_custom_signer(<some_root>, example_signer, <some_chain>)
Copy link
Member

@sergiitk sergiitk Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note - we require arguments to be passed as keyword arguments now.

Suggested change
creds = ssl_channel_credentials_with_custom_signer(<some_root>, example_signer, <some_chain>)
creds = ssl_channel_credentials_with_custom_signer(
private_key_sign_fn=example_signer,
root_certificates=b"<some_root>",
certificate_chain=b"<some_chain>",
)

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up!

Please let me know if you have any questions. Thanks!

to configure this feature will rely on a new certificate provider
implementation.

We will create an `InMemoryCertificateProvider` that takes a set of root certs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are really 3 distinct API changes here, and I think it would be helpful to discuss each of them separately:

  1. Adding InMemoryCertificateProvider, and deprecating StaticDataCertificateProvider.
  2. Allowing a separate cert provider for root and identity certs.
  3. Adding the new PrivateKeySigner API.

I think we should first list the changes, then talk about each one in detail in its own subsection. For each one, we should describe why we're making the change and then document the API we're introducing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the subsections describing each of the three changes in detail, please add a bulletted list of the three changes, with just a few words about each to say why we're doing it. That way, the reader has a bit of context about why we're making these changes before we dive into the details.


| Metric | Type | Unit | Labels | Description |
| :---- | :---- | :---- | :---- | :---- |
| `grpc.security.handshaker.offloaded_private_key_sign_duration` | Histogram | float64/double s | `grpc.target, grpc.security.handshaker.offloaded_private_key_sign.algorithm, grpc.security.handshaker.offloaded_private_key_sign.status` | How long the offloaded private key signing took |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually implemented this metric in C-core. Do we still need to do that?

Implementing this in C-core raises a layering question. The TLS TSI handshaker is the thing that has this data to export, but the non-per-call metrics APIs are gRPC-specific APIs, which means that the TSI implementation would need to use gRPC-specific APIs. Won't that make it harder for this TSI implementation to be used outside of gRPC? Are we okay with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no security metrics yet (that's happening ~now, we are working on the concept review) - this will be included in that work.

The layering question is one of the prime questions for our concept and design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we haven't implemented this as part of this feature, how about we leave the metric out of this gRFC and include it in the forthcoming gRFC for all of the security-related metrics?

@gtcooke94 gtcooke94 requested a review from markdroth March 19, 2026 19:12
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving in the right direction!

Please let me know if you have any questions. Thanks!

Comment on lines +36 to +37
* [A66]
* [A79]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change these lines to:

* [A66: OpenTelemetry Metrics][A66]
* [A79: Non-per-call Metrics Architecture][A79]

The updates will return an `absl::Status` to the caller, and further a
`ValidateCredentials` API is provided.

```c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say c++ instead of just c.

the user. This provides a strict super-set of functionality of the current
`StaticDataCertificateProvider` - calling `UpdateRoot` and `UpdateIdentity` on
an `InMemoryCertificateProvider` at setup is the same as a
`StaticDataCertificateProvider`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should explicitly state that we are going to deprecate StaticDataCertificateProvider.

void set_identity_certificate_provider(grpc_core::RefCountedPtr<grpc_tls_certificate_provider> certificate_provider);
#### Splitting Certificate Providers into Identity and Root Providers
We will also decouple the root provider and the identity provider in [the
tls\_credentials\_options](http://google3/third_party/grpc/src/core/credentials/transport/tls/grpc_tls_credentials_options.h;l=77;rcl=731487339)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links to codesearch don't work in OSS. Please use github links instead.

```c++
class TlsCredentialsOptions {
public:
[[deprecated(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put a comment here saying that this is the existing method, which is now marked as deprecated.

Also, please add a comment before the last two methods indicating that those are the new methods being added in this design.

| Name | Disposition | Description |
| Label Name | Disposition | Description |
| :---- | :---- | :---- |
| `grpc.target` | required | Indicates the target of the gRPC channel for which this handshake is occurring |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for this label should note that it's the existing label defined in A66.

debugging failures.

| Name | Disposition | Description |
| Label Name | Disposition | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a sentence in its own paragraph before this table:

"""
The new metric will use the following labels:
"""

length. For example, `RsaPkcs1Sha256`. For more, see the `SignatureAlgorithm`
enum in the C section above.

| Metric | Type | Unit | Labels | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a sentence before this table:

"""
Here is the new metric we will add:
"""

| `grpc.status` | optional | The status code return for the private key sign function. From [A66] |

The `grpc.tls.private_key_sign_algorithm` label will contain both a name and key
length. For example, `RsaPkcs1Sha256`. For more, see the `SignatureAlgorithm`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want the values for this label to be the same in all languages. Is that going to be a problem if we define the values via a C++ enum?


| Metric | Type | Unit | Labels | Description |
| :---- | :---- | :---- | :---- | :---- |
| `grpc.security.handshaker.offloaded_private_key_sign_duration` | Histogram | float64/double s | `grpc.target, grpc.security.handshaker.offloaded_private_key_sign.algorithm, grpc.security.handshaker.offloaded_private_key_sign.status` | How long the offloaded private key signing took |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we haven't implemented this as part of this feature, how about we leave the metric out of this gRFC and include it in the forthcoming gRFC for all of the security-related metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants