This blog post adapted from a talk originally presented at DevOpsDays Cairo 2024.
Hello! I'm Dan, and I'm currently working as an DevRel Engineer at Cerbos. I've had the privilege of enjoying a fairly lengthy career in technology, including roles at organizations like Scaleway, Datadog, Mozilla, and Ubisoft. I mention these things to provide context for the perspectives I'll be sharing today about stateless architecture.
Before we dive into our main topics—core principles, advantages and disadvantages, and practical concerns—there's something fundamental we need to address first: what exactly is state?
State refers to any information that a system or application needs to retain between different requests or interactions to understand and respond correctly to subsequent requests from the same user or process. The key word here is "retain"—it's the retained information that is necessary for a given interaction to be successful.
State can take many forms, but some common examples are:
State is basically everything, everywhere, all at once. So how could anything be stateless?
That's a fair question. We could also ask: is anything real? What is the nature of reality? In fact, there's a whole field of philosophy dedicated to understanding the nature, origin, and limits of knowledge: epistemology. But that's not what we're here today to discuss. 😅
To be clear, yes, stateless is a real thing, but in much the same way that serverless still involves tons of servers, stateless still involves state—it just deals with that state in very specific ways.
There are five core principles, or pillars, of stateless design.
Each request from a client to a server must be self-contained. The client sends everything the server needs, and the server processes it and sends back everything the client expects. The server doesn't rely on knowledge from any previous interactions to handle the current request.
This means that every request has all the necessary information for that interaction to be successful; in other words, 100% of the necessary state is contained within the single interaction. Remember when I said stateless still involves state? This is what I meant.
This principle is logically derived from the first, but it stands as a separate pillar because it's such a fundamental concept. Any concept of continuity, such as a user session or a series of logically-related transactions, has to be managed outside of the transaction itself. Again, stateless doesn't mean that state doesn't exist—it means that we have to think about it in different ways.
Idempotence is a mathematical or computer science concept. It refers to a property of certain operations whereby the operation can be performed or applied multiple times without changing the result. You see this in abstract algebra (which is where we get closures from) and thus in functional programming as well.
But if you're not a math nerd, an easier example of idempotency would be a light switch.
If I press it on, the light turns on, and if I keep pressing it on, it stays on. In other words, I can perform the operation as much as I want, but the result won't change. You see this a lot also in configuration management—in fact, the first place I encountered this concept outside of math class was in the early days of Puppet; but I digress…
This is the standard term—but when we say "decoupled", we should probably be saying "loosely coupled"; however, naming things is hard (see: DevOps). Anyway, consider a traditional architecture where there's shared session data—or, better yet, shared memory, namespaces, direct dependencies, and so forth. Stateless doesn't have that.
Instead, whatever collection of services—databases, caches, whatever—have to communicate with one another through defined (hopefully well-defined) interfaces. The most obvious example here is an HTTP API, but there are others. For those of you thinking to yourselves, "hmm this sounds like microservices", the answer is yes. Yes it does. More on that later.
This is logically derived from the other principles, but like external state management, it stands on its own because it's such a fundamental aspect of stateless architecture. Decentralization is the watchword here.
Remember idempotency, where the same request gives the same result every time—this has to be true across an arbitrary number of source and destination pairings. In stateless architecture, you spread the load across as many handlers (nodes, instances, endpoints, whatever) as possible. This has implications not just for scalability, but for design. We have to design our systems with this in mind. And, yes, your cloud provider will love all the new instances they get to sell you. 😉
Now, if you're thinking, "I get all of these individual points, but not how they fit together", don't worry—we're going to go through all those points again. But this time through, I'm going to connect the dots between each of these pieces.
Let's start with independent requests. On the upside, since each request is self-contained, the server doesn't need to remember anything about the previous requests. This makes the system more resilient to failure because no single request depends on any other. If a node disappears, it's fine, because thanks to idempotency, even if the transaction is unresolved, you can just try again.
However, there are some trade-offs. Since each request is self-contained, it can result in really big payloads. This can definitely be an issue if network capacity is a constraint. Also, relatively straightforward failure modes can result in thundering herds of (potentially massive) payloads, with all the incumbent bandwidth and processing overhead concerns. Suffice it to say, this is an area where retry strategies become really important.
State management is effectively offloaded to the client, or to an external system like a database, or a combination of both. This means that the processing node (i.e., the server) only needs to deal with the processing of the request itself, and is thus totally ephemeral. In general, this means that any given node can be relatively lightweight from a resource perspective, and it also means you can cycle nodes whenever you like, which makes crusty old sysadmins like me very happy.
On the downside, you're now storing state in places that might be unfamiliar to you. Authentication tokens, user settings, context, just sitting around on the client side… that's a whole bunch of new and exciting security issues to consider (fun!).
Idempotency is a really neat property. If the network fails, or a transaction is left unresolved, or whatever, it's not a problem. Just try again. Out of order? Try again. Missing packet? Try again. It doesn't matter. Any error modes that result in unresolved states are, mechanically, much easier to recover from. From a reliability perspective it's extremely powerful.
On the downside, it's a design constraint—and one that can be challenging for teams that aren't used to working with that constraint. There are so many side effects of non-idempotent system design that most of us just take for granted, so learning to work within the model can be a journey of discovery.
Decoupling strongly implies modularity. Each component can—indeed, must—be developed, deployed, scaled, and managed independently. This can speed up development cycles, which is great, and reduces the risk of a failure in one module taking the whole system down. Large companies with lots of product teams tend to gravitate this way naturally, so organizationally it's a good fit. And, as I mentioned earlier, this is the basis for microservices. Stateless architecture and microservice architecture work perfectly together!
On the downside, you're now introducing a whole coordination layer: APIs, event-driven triggers, messaging systems, and so forth. There are implications here for versioning, for testing, for deploying, for decomissioning, and so forth. It's not any more or less complicated per se, but it is shifting the complexity into different places. Also—and this is important—latency and bandwidth are now very much an issue. If your system requires speedy, reliable networking in order to function, then you need speedy, reliable networking. Ergo, network bottlenecks are absolutely a concern.
Finally, we get to horizontal scaling. The upside is really simple: if you have a resource problem, literally throwing more machines at that problem will probably make it go away. Is your problem traffic spikes? More machines. Suddenly increased workloads? More machines. Backplanes, sockets, and undocumented cloud provider usage caps? Throw more machines at it! Problem solved!
What's more, you more or less get automatic load balancing out of the deal. And, of course, if a node fails, who cares? Spin another one up and try again—it's idempotent, remember?
On the downside, congratulations: you are now the proud owner of a distributed system, with all the fun and interesting side effects and emergent behaviors that come with owning a distributed system. I hope you have a good SRE team. 🛠️
And speaking of a good SRE team, let's talk about the practical concerns of managing a stateless stack. Spoiler alert: like everything in the world of technology, it's trade-offs all the way down. I'd like to address three concerns that come up pretty regularly when you're working with stateless architecture: user sessions, caching mechanisms, and deployment lifecycle management. That last one is a biggie.
I may have mentioned once or twice that stateless applications don't store state on the server. As you might imagine, this has implications for handling user sessions. The short answer is that session data has to be externalized, which means either external services—or, more likely, client-side tokens. JWT (pronounced "jot"), or JSON Web Tokens, are more or less the de facto way of handling this.
Super briefly, a JWT is a URL-safe method for representing claims between two parties. Basically, it's token data that can ensure identity across multiple systems without needing to exchange credentials every time. On the positive side, JWTs are stateless by design, compact, and offer cryptographically secure signatures. On the downside, they can be difficult to revoke, and the payload is transmitted in plaintext.
There's a lot to be said about JWT, and while it isn't the primary focus here, it's worth mentioning because it's so heavily implemented in stateless architecture. You'll definitely want to check it out if you're moving in this direction.
Moving on to one of the great problems of computer engineering: cache invalidation. The reality is that caching is important within the context of stateless architecture. Good caching is going to pay massive dividends in performance, especially with regards to network latency and overhead.
Where you cache is… less clear. You could go the traditional route of setting up RedisValkey or something—and that's fine, assuming you want to manage even more distributed systems. You could also cache at the edge—as in your CDN, for example. Depending on your use-case, you might also be able to use the client browser. There are a lot of options.
No matter how you do it though, there is one universal truth: you need to test your caching mechanisms! The usual culprits apply for any sort of distributed store—such as load testing—but don't forget the things which are really specific to caching itself. What happens when the cache misses? What's the performance impact there? And what happens when the cache is invalidated unexpectedly, which absolutely happens all the time. How consistent is your cache? What happens when your app encounters inconsistency? Test. It. All.
I'm going to end with the element of all of this that keeps me up at night, and that's deploying and managing long-lived stateless applications and stacks. Stateless is great, and it's absolutely the way forward for a lot of what we're doing as a distributed, security-forward, modern industry. But there are challenges, and we do need to address them. As before, a spoiler alert: everything I'm about to say is basically a laundry list of concerns for microservices in general.
The reality is that your stateless application is just one piece of a much larger series of interconnected parts that all need to function—and function they will, to greater or lesser degrees, and not always in ways you want. However much attention you think you need to pay to dealing with error conditions, retries, data consistency issues, and so forth—double it. Error handling is your new favorite pastime. 📊
This is even more true if you're deploying in different environments, like different cloud providers. Infrastructure as Code, robust CI/CD pipelines, and strong secret (or just variable) management is clutch here.
Whatever mechanisms you're using to store state need to be robust. When they fail—and they will fail—you need retries and fallback strategies to deal with that. Object storage API isn't responding? Your CDN isn't invalidating like you expect? Service failures are something you need to deal with.
But before you can even think about those services, you need to consider how you even become aware of those services. Service discovery is way easier today than it was, say, 10 years ago—think etcd or consul—but now you're managing even more distributed services. I know that I already said this, but, I hope you have a strong SRE game.
Microservices—er, I mean stateless applications—and load balancers go hand in hand. There are a lot of great load balancing solutions out there, and many of them have a really neat feature called session persistence. This is a great way to ensure that client requests are always routed to the server that is managing that client's session.
But in stateless, there is no session persistence—at least, not at this layer. Because any node can handle any request, you want to avoid sticky sessions altogether. Likewise, when it comes to autoscaling, don't overthink it. Really simple resource-based triggers are probably all you need.
Now, I could go on all day about this stuff, because every one of these topics is a real rabbit hole; but we do have to wrap things up, so here's an attempt at a conclusion.
Stateless architecture is powerful stuff. Simplified server-side design, ready for load balancing and scaling without any issues, resilience and fault-tolerance for free, and here's something we didn't even get into: it's security-forward by design. However, there are trade-offs: data and network overhead, externalized state management, tokens and dependencies and all that stuff—it all needs to be dealt with.
…actually that's not really a conclusion, is it? Let me try again.
The real conclusion is this: take the time to really, and I mean really understand your use-case. Stateless—like serverless, like microservices, like Kubernetes and every other model and solution under the sun—they're all just tools. Tools for your toolbox. Don't be afraid to slow down and really learn about your tools. Choosing the best tool for the job can be surprisingly difficult, but if you get it right from the start, it makes the rest of your job much, much easier.
Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team
Join thousands of developers | Features and updates | 1x per month | No spam, just goodies.