Designing service discovery and load balancing in a microservices architecture

Published by Emre Baran on November 25, 2024

Transitioning from a monolithic architecture to microservices is an intricate, time-consuming task. It demands both strategic foresight and meticulous execution.

In this 10-part series, we’ll guide you through the most common challenges faced during monolith to microservices migration. Last week we published the third part of our series, on how how to pick the right inter-service communication pattern for your microservices. This week, we are going to dive into service discovery and load balancing in microservices architecture.

We'll publish a new article every Monday, so stay tuned. You can also download the full 10-part series to guide your monolith-to-microservices migration.

Service discovery brief intro

Static service discovery mechanisms, such as hardcoding service locations or using load balancers, fail when tasked with navigating the complexities of a microservices architecture.

Service discovery mechanisms allow microservices to locate and connect to the appropriate instances of other services that they need to communicate with. When tasked with locating and connecting instances across microservices, these systems become overwhelmingly complex and difficult to manage. And because they are built to communicate within a monolithic environment, they encourage tight coupling across your architecture creating a ‘distributed monolith’, as discussed in the first article of this series.

Dynamic service discovery, like service registries and service meshes, are better solutions for microservices architecture. They allow your distributed system to stay flexible through loose coupling because locations don’t need to be hardcoded for services to discover each other. They also built in resilience with the ability to reroute traffic over failed instances.

We’re going to dive into how each of the above service discoveries work so you can decide on the best option for your architecture.

What is a service registry?

A service registry is a centralized database that maintains the instances, and network locations of all available services, and then makes them available for application-level communication. Services first register with the database to signal they are available and then query the registry to find any instances they need.

How does a service registry work?

Services register themselves when they come online, making them discoverable.
When a service needs to discover and connect with an appropriate instance, it queries the registry to connect with the right instance.
As services come and go, the registry is updated to reflect the changing environment to ensure a client receives information that connects it with healthy instances.

Key features of a service registry

Key features of a service registry.png

Registry and deregistry When a service instance starts, it registers its network location (IP address and port) with the service registry. When a service instance shuts down or becomes unhealthy (i.e. non-responsive), it deregisters itself from the service registry so that other services don’t attempt to connect to it.

Discovery Service registries allow for dynamic discovery of service endpoints without hardcoding network locations. Instead of having a set network location where instances can be found, services query the registry to find the instances of another service.

Load balancing Most service registries work with load balancers. They distribute traffic evenly by splitting traffic across multiple service instances. This helps enhance both performance and fault tolerance.

Health checks Many service registries include built-in health check mechanisms, typically a health check URL. The registry then periodically verifies the health of registered service instances by querying the URL. If an unhealthy instance is found, it’s deregistered.

Metadata storage Aside from registering services for discovery, registries can also store metadata about registered services. This data can include version numbers, configuration settings, and custom attributes which it uses to route requests based on specific criteria or provide detailed information for monitoring and debugging.

Commonly, service registries can be implemented using network-accessible key-value stores. Now let’s talk about examples of service registries:

A distributed coordination service, Apache Zookeeper provides robust features for service discovery and configuration management.
Consul is a popular HashiCorp tool which offers service discovery, health checking, and a distributed key-value store.
Developed by CoreOS (now part of Red Hat), Etcd is a distributed key-value store. It’s often used for storing configuration data and service discovery information.

What is a service mesh?

A service mesh takes service discovery a step further by implementing a dedicated infrastructure layer for service-to-service communication. This frees individual services from having to implement the complex functionalities of service discovery.

How does a service mesh work?

Typically, service meshes are implemented as a network of lightweight proxies (e.g. sidecars) deployed alongside each service instance. These proxies handle service discovery, load balancing, encryption, and other cross-cutting concerns.

Key features of a service mesh

Key features of a service mesh.png

Traffic management A service mesh can help manage traffic through your distributed systems through a combination of load balancing, traffic shaping and splitting, and routing. It prevents bottlenecks in the system by distributing incoming requests across multiple service instances.

Service meshes also control traffic between services through rate limiting and throttling to shape traffic and smooth out spikes. In the same way, service meshes can split traffic between service versions, allowing for canary releases, A/B testing, and gradual rollouts of new features. Finally, service meshes typically offer advanced routing capabilities including request-based routing, header-based routing, and URL path-based routing. This allows for granular control over how traffic is directed between services.

Security With various measures to provide enhanced security in microservices architectures, service meshes help your team build a secure system. Mutual TLS (mTLS) secures communication between services by encrypting traffic between endpoints. They also support authentication and authorization by offering fine-grained control policies.

Properly set up, these authorizations restrict services so only those authorized can access specific services and endpoints. The ability to manage service identities and certificates simplifies the implementation of secure service-to-service communication.

Observability Typically, meshes include the automatic collection of metrics related to service communication, including the four golden signals (traffic, latencies, saturation, and error rates). This gives your team insights into the health and performance of services. They also integrate with distributed tracing systems (which we’ll cover in our next article) and gather and aggregate logs of service interactions. This increased visibility gives developers more visibility in the system so they can effectively monitor and troubleshoot systems to enhance performance.

Resilience A service mesh offers greater flexibility than static systems, building resilience into your architecture. One of the key tools meshes have is the ability to implement circuit breakers. Teams are also able to configure automatic retries to set timeouts to handle transient failures and ensure timely responses. A mesh also allows your team to control the rate of incoming requests to prevent services from getting overwhelmed with requests.

Policy enforcement Service policies are built into a service mesh, allowing your team to define and enforce traffic management policies, security and resource usage. Your team can also control access with role-based control (RBAC). This allows you to regulate who can perform specific actions and on which services they are able to do so.

Service discovery and registry Dynamic service discovery automatically discovers services and their instances. This ensures the mesh is constantly up-to-date and able to route traffic to the right instance. The service registration of a mesh allows dynamic scaling and seamless service discovery, even in a dynamic microservices architecture.

Integration and extensibility Integration with existing tools is simple with service meshes, including building in observability, security, and management tools This gives your team the ability to seamlessly add your service mesh to existing infrastructures. They also offer extensible frameworks that allow operators to add custom plugins and extend their capabilities to meet specific requirements.

Popular service meshes to check out:

A widely used service mesh, Istio offers robust traffic management, security, and observability features.
Linkerd is a lightweight service mesh focused on simplicity, performance, and security.
Part of HashiCorp Consul, Consul Connect provides service discovery, configuration, and secure service-to-service communication.

Load balancing in complex microservices architectures

Both service registries and service meshes have built-in load balancing aspects to distribute incoming traffic across multiple instances of a service. This way, they ensure optimal resource utilization, guarantee high availability, and generally improve performance. Here's a helpful explanation of load balancing from Software Engineering StackExchange:

“Load balancing a service allows clients to be decoupled from the scalability of those other services. All clients have a single URL to interact with. Cloud environments have automated tools that can add and remove nodes behind a load balancer. This helps enable the scalability promised with micro services.

If load balancing decouples clients from the scalability of a service, then service discovery decouples clients from knowing which URLs can be used to communicate with the other services. Think of service discovery as an index of all the microservices in your ecosystem. The meta data about each service should return the URL of the load balancer in front of a service.”

There are supplementary tools you can use to increase the resiliency of your system and improve performance. NGINX, HAProxy, and cloud-based load balancers (e.g. AWS Elastic Load Balancing, Google Cloud Load Balancing), help distribute traffic across service instances based on various algorithms like round-robin, least connections, or weighted distribution.

These tools can implement load balancing at different levels, such as the network level (L4) or the application level (L7).

L4 load balancing operates at the transport layer and distributes traffic based on IP addresses and ports.
L7 load balancing operates at the application layer and can make routing decisions based on the content of the request, such as URLs or headers.

Now that we've covered the technical concepts, let’s dive into an interesting example from Airbnb.

Airbnb builds in dynamic service discovery and load balancing with SmartStack

When Airbnb moved to microservices, their static service discovery solutions didn’t work anymore. So, they developed proprietary software to manage service discovery, service registration and load balancing in their complex microservices architecture, which they called SmartStack.

Supported by observability and monitoring tools, SmartStack allowed Airbnb to create an effective service discover system in their microservices environment.

Registration, discovery and load balancing with SmartStack

Airbnb’s development team decided to break the three services down into two distinct components, which they called nerve and synapse.

Nerve is a service registration daemon that registers services with a distributed key-value store, much like Zookeeper. It periodically checks the health of the services and updates their registration status to maintain a clean registration of all functioning instances.
Synapse is a service discovery and load-balancing component that acts as a transparent proxy for service communication. It subscribes to the key-value store to discover the available service instances and routes traffic to them based on configurable load-balancing algorithms.

These components integrate with Airbnb's infrastructure automation and deployment tools. They automatically register services as they are deployed to ensure all healthy service instances are discoverable. SmartStack also provides automatic failover, so if the system is compromised it re-routes all data to a standby system.

Integrating flexibility & resilience

SmartStack allowed services to discover and communicate without the need for manual configuration or hardcoded service locations, freeing the team to scale their services independently. Load balancing and automatic failover built resilience into the system by optimizing resource allocation (without overwhelming instances) and ensuring failures were handled gracefully.

Real-time insight

Airbnb integrated their observability and monitoring tools, such as Datadog and Grafana, with SmartStack to gain real-time visibility into the health and performance of their services. This gave them a better understanding of service dependencies, traffic patterns, and potential bottlenecks. Armed with this in-depth knowledge, Airbnb proactively identified and resolved issues before they caused chaos in the system.

With SmartStack, Airbnb created a service discovery system that could successfully manage the complexities of their microservices architecture. It improved reliability, enabled a structure that could scale independently and, in the end, delivered a seamless user experience to millions of users worldwide.

Looking ahead

Ready for the next in the series? Continue to "Monitoring and observability in microservices".

Or, you can download the complete 10-part series in one e-book now: "Monolith to microservices migration: 10 critical challenges to consider".

Guide

Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team

Insights from Gartner IAM Summit 2025 - Identity, authorization, and the road ahead

Alex Olivier on March 28, 2025

Guide

Staying compliant – What you need to know

Alex Olivier on March 25, 2025

Guide

Securing non-human identities: Understanding and addressing the OWASP top 10 threats

Alex Olivier on March 24, 2025

Guide

Designing service discovery and load balancing in a microservices architecture

Service discovery brief intro

What is a service registry?

How does a service registry work?

Key features of a service registry

What is a service mesh?

How does a service mesh work?

Key features of a service mesh

Load balancing in complex microservices architectures

Airbnb builds in dynamic service discovery and load balancing with SmartStack

Registration, discovery and load balancing with SmartStack

Integrating flexibility & resilience

Real-time insight

Looking ahead

Related Articles

What is admin-time / static authorization?

Insights from KubeCon Europe 2025 - AI gets real, and identity gets serious

SPIFFE identity parsing added to Cerbos PDP

Insights from Gartner IAM Summit 2025 - Identity, authorization, and the road ahead

Staying compliant – What you need to know

Securing non-human identities: Understanding and addressing the OWASP top 10 threats

Subscribe to our newsletter