From HTTP to gRPC: Trials, tribulations and triumphs of porting our Python SDK

Published by Sam Lock on September 07, 2023
From HTTP to gRPC: Trials, tribulations and triumphs of porting our Python SDK

At Cerbos, we simplify your authorization by decoupling it from your code and providing a scalable, low latency PDP (policy decision point) that runs alongside your application. In short, we replace all of your sprawling, nested if/else statements with a simple request that looks a bit like:

“Can this principal do this action on this resource?”

Cerbos itself is a single Go binary which runs as a server, exposing its services via gRPC or HTTP APIs. We love protobufs and gRPC because they provide us with a highly efficient, structured and language-neutral way of serializing data and executing remote procedure calls. We love them so much, in fact, that our CTO even took some time to write this excellent piece about them.

We’ve built a number of SDKs to streamline communication with the PDPs. Up until recently (for reasons that only became clear upon embarking on this journey), the Python SDK was the only one which still used the HTTP API. This post describes the trials and tribulations that I undertook in order to port it across to the wonders of gRPC.

Why?

Originally, the task was a simple one: to add support for our Admin API to the Python SDK. Now, I’m not saying that I have a tendency to make mountains out of molehills, but what I will say is that I saw an opportunity for a “quick win” to port the SDK to gRPC and then (pretty much) gain the protoc-generated admin API interface for free! Unfortunately, l also need to say that the definition of “quick” became quite skewed.

I think it’s worth stating that I stand by my decision to do this. Our protobuf definitions are freely available in a Buf registry and we can use Buf’s wonderful tooling to effortlessly generate our python classes and clients for us.

Alas, as is customary in software development, it wasn’t quite that easy.

Let’s get stuck in

So, I made myself a coffee, climbed up into my loft-office/lair, and set about making this tiny change.

To break or not to break (backwards compatibility)?

First things first; our SDK has been around long enough that it has some notable usership. I can’t break the client for existing users, and I don’t fancy bumping the major version and maintaining two separate branches. The current SDK defines custom classes which map almost 1:1 with the protoc generated ones, but the gRPC service stub expects the exact gRPC types. So I can’t invisibly swap out the internals of our client class without some sort of translation layer to swap instances of the custom classes to the generated ones. This translation seemed doable, but also, probably overkill. I thought to myself “what’s the point”; we have a perfectly functioning HTTP client and it’s already backwards compatible because I haven’t changed it yet 🙂. Fine, I’ll just build a separate one.

How thin is my wrapper?

When you run your protobuf definitions through the protoc compiler, it spits out a hell of a lot of useful stuff for you. This isn’t just the message type definitions, but also the client stub you use to talk to the gRPC server. So, do we even need a custom client? Well, yes, but only so we can pass custom configuration to the gRPC channel. Here’s a snippet from the client constructor:

        method_config: dict[str, Any] = {}

        if methods:
            method_config["name"] = methods

        if timeout_secs:
            method_config["timeout"] = f"{timeout_secs}s"

        if request_retries:
            method_config["retryPolicy"] = {
                "maxAttempts": request_retries,
                "initialBackoff": "1s",
                "maxBackoff": "10s",
                "backoffMultiplier": 2,
                "retryableStatusCodes": ["UNAVAILABLE"],
            }

        if wait_for_ready:
            method_config["waitForReady"] = wait_for_ready

        options = {
            "grpc.service_config": json.dumps({"methodConfig": [method_config]}),
        }

        if channel_options:
            options |= channel_options

        opts = [(k, v) for k, v in options.items()]
        if tls_verify:
            self._channel = grpc.aio.secure_channel(
                host,
                credentials=creds,
                options=opts,
            )
        else:
            self._channel = grpc.aio.insecure_channel(host, options=opts)

The method calls simply create the request instances from the passed in protobuf type instances and pass them to the client. Easy:

@handle_errors
async def check_resources(
    self,
    principal: engine_pb2.Principal,
    resources: List[request_pb2.CheckResourcesRequest.ResourceEntry],
    request_id: str | None = None,
    aux_data: request_pb2.AuxData | None = None,
) -> response_pb2.CheckResourcesResponse:
    """Check permissions for a list of resources

    Args:
        principal (engine_pb2.Principal): principal who is performing the action
        resources (List[request_pb2.CheckResourcesRequest.ResourceEntry]): list of resources to check permissions for
        request_id (None|str): request ID for the request (default None)
        aux_data (None|request_pb2.AuxData): auxiliary data for the request
    """

    req_id = _get_request_id(request_id)
    req = request_pb2.CheckResourcesRequest(
        request_id=req_id,
        principal=principal,
        resources=resources,
        aux_data=aux_data,
    )

    return await self._client.CheckResources(req)

The plot thickens

Cool, so we have our client structure, let’s furnish it with our generated code. As it stands, the project structure is fairly straightforward:

.
├── cerbos
│   ├── __init__.py
│   └── sdk
│       ├── __init__.py
│       ├── _async
│       │   └── client.py
│       ├── _sync
│       │   └── client.py
│       ├── client.py
│       └── model.py
├── dist
├── pdm.lock
├── pyproject.toml
├── tests
└── utils
    └── gen_unasync.py

We maintain our top level cerbos module within which we write our application logic. We define our custom classes in cerbos/sdk/model.py, and implement them in cerbos/sdk/_async/client.py. We use a library to auto-generate our synchronous code which magically appears in cerbos/sdk/_sync/client.py. These clients are imported and exposed conveniently from within cerbos/sdk/client.py. Great, all simple stuff. Moving on…

It seems like a bad idea to blur the lines between custom and generated code. We use buf generate to generate the code. Behind the scenes, it retrieves the protobuf definitions from our Buf repository, compiles them, and then spits out equivalent python modules verbatim. This is super easy, but we also don’t know what arbitrary module structure we might end up with, so to be on the safe side, let’s create a directory cerbos/proto/ and whack all generated code in there. Then we end up with something that looks like this:

.
└── cerbos
    ├── __init__.py
    ├── proto
    │   ├── cerbos
    │   ├── google
    │   ├── protoc_gen_openapiv2
    │   └── validate
    └── sdk
        ├── __init__.py
        ├── _async
        │   └── client.py
        ├── _sync
        │   └── client.py
        ├── client.py
        ├── container.py
        └── model.py

Cool! Now we can just import our generated code in our client module with something like…

from cerbos.proto.cerbos.engine.v1 import engine_pb2

Right? Wrong!

Run that code and you’ll eventually hit an import error; it’ll find the top level module, but down the import tree, things get a bit more messy. You see, cerbos.proto.cerbos.engine.v1.engine_pb2 is where we expect it to be, but if you load up that module and look at it’s imports, it’s forming its own like:

from cerbos.effect.v1 import effect_pb2 as _effect_pb2

Annoyingly, we know that module to actually be located here:

from cerbos.proto.cerbos.effect.v1 import effect_pb2 as _effect_pb2

In hindsight, this makes sense. How could the protoc compiler have any context of the target package structure? Instead, it copies the protobuf package structure and recreates it verbatim, imports and all. OK, not a problem. Surely, protoc supports an argument to use relative imports rather than absolute ones? Consider this: if, from a file within cerbos/, we swapped:

from cerbos.effect.v1 import effect_pb2

For:

from ...effect.v1 import effect_pb2

It will start working!

Well, as it turns out, it does not, and I’m not the first person to run into this problem. In fact, there have been some pretty in-depth discussions on this topic, and the maintainers at Google went into depth to describe the reasoning for the implementation. In short, protobuf operates in a single, flat namespace – Google developed it for their own use case: to allow them to parallelize compilation across the many thousands of .proto files they maintain, rather than relying on single, massive protoc runs. Enabling relative imports, in some cases, breaks the guarantee that two parallel compilation runs on two proto files will generate the same output as a single run on both.

It seems that there’s a fairly strong case to implement a non-default option to allow the use of relative imports, but the team is still reluctant to implement this, for the very reason mentioned above. As it stands, the discussion is still ongoing, users are still nudging for the implementation, but it’s unlikely to change any time soon. Let’s look for some alternatives.

Python path gymnastics 🤸

The idea of mangling the Python path in order to impact import paths generally gives me the heebie-jeebies, but let’s entertain it for a second: theoretically, we could create a directory to house our generated code (e.g. the cerbos/proto/ directory mentioned above), add that parent directory to Python path, and the interpreter would also look there for import origins. I didn’t spend too long trying this, because I then hit namespace collisions, panicked, and binned the whole approach.

The good ol' sed switcheroo

Another approach is to run the compilation and then post-process the import paths for the generated modules to make them relative. This was slightly less scary to me than the PYTHONPATH approach above, but still; arbitrary string manipulations make me nervous and the idea of an additional post-compilation build step seemed a little silly. Plus, we have a fairly deeply nested module hierarchy with lots of dots, so I binned this approach as well.

It’s worth mentioning that it turns out that some very helpful people out there built a tool for this very approach: Protoletariat. This is really great, and something I strongly considered. However, I erred against it purely in the interest of avoiding an additional build dependency.

Surrender to the Google ways

Eventually, I sat with this problem long enough that I started to question my motives for maintaining the original package structure. I originally wanted a clean boundary between the generated code and our custom built convenience wrapper (and I admit, there are still benefits to this approach, which I’ll discuss below 👇), but in essence, the generated code is all clean, publicly exposed Python.

The custom client we provide, as mentioned, is really just a thin wrapper. The users already had to import certain generated classes in order to interact with our client, and if they wanted to write their own, they’d have all the classes available to do so.

So this is what I did; I ran the compilation step, directed the output to the root of the package, and hung up my keyboard for the day.

What ifs

I’ll admit, I’m still not 100% satisfied with this approach. And I shall tell you for why.

Namespace collisions

Our project is called cerbos. Our Python package name is cerbos. Our protobuf package name is also cerbos. This all seems quite reasonable to me, from a design standpoint.

When you compile the protobuf definitions into the root of the package, the top level cerbos/ directory which housed our custom code is forced to merge with all generated code which shared the same top level name. This isn’t too problematic if you know where to look, but if you’re a new maintainer coming into the project, it could all seem a little more intimidating.

What’s more, I’m yet to figure out how to cleanly purge orphaned modules. If a top level proto directory did exist in our Buf registry, but no longer does, our repository has no knowledge of that deletion, so the generated code will sit; stale in our repository, until someone looks deep enough to realise that it’s no longer needed.

A port well worth the voyage?

Contrary to my initial expectations, porting our Python SDK from HTTP to gRPC was far from a simple task. The journey was filled with trials, tribulations, unexpected twists, and moments that led me to question the very nature of Python imports. But as I reflect on the process, I believe that the triumphs overshadow the challenges.

By harnessing the potent capabilities of gRPC, we've gained enhanced efficiency and adaptability. Though arduous, the journey yielded invaluable insights, spurred philosophical musings about package structures, and provided a glimpse into Google's methodologies.

Is the result flawless? Not quite. Puzzles like namespace collisions and lingering orphaned modules remain, adding intrigue more than detracting from our success. For now, these peculiarities lend character to the project. The SDK has emerged more robust and versatile. The odyssey of porting it transformed what seemed a "quick win" into a rewarding exploration, enriching our appreciation for the technology and tools we use.

That said, if any of you know how to solve some of the outstanding puzzles, then please head on over to our Slack community and let us know!

Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team