How to pick the right inter-service communication pattern for your microservices

Published by Emre Baran on November 18, 2024
How to pick the right inter-service communication pattern for your microservices

Transitioning from a monolithic architecture to microservices is an intricate, time-consuming task. It demands both strategic foresight and meticulous execution.

In this 10-part series, we’ll guide you through the most common challenges faced during monolith to microservices migration. Last week we published the second part of our series, on how to better manage your data and provide consistency across your entire application. This week, we are going to examine inter-service communication.

Inter-service communication is an integral part of a successful microservices architecture. Handling all that communication efficiently—and avoiding adding excessive latency—is one of the major challenges of the microservices architecture.

We'll publish a new article every Monday, so stay tuned. You can also download the full 10-part series to guide your monolith-to-microservices migration.

Picking the right communication pattern for your microservices

Without seamless communication between microservices, the functionality of the app and the experience of the end-user suffers. To make sure you maintain a good user experience you need to tailor inter-service communication to match the demands of your system and make sure it can handle failure scenarios.

The first step to tailoring communication in your system is finding the right communication patterns.

Synchronous communication

Synchronous communication patterns, such as REST or gRPC, are simple request-response interactions. A service sends a request and then waits for a response from another service.

  • REST (Representational State Transfer) is a stateless protocol often used over HTTP. It’s highly scalable and widely supported.
  • gRPC (Google Remote Procedure Call) uses HTTP/2 for transport, providing features like bi-directional streaming and efficient binary serialization.

When multiple microservices synchronously communicate (like in the diagram below), they end up executing the interactions in series. This means the final response must come after all other steps have finished.

Synchronous communication.png

This approach ensures consistency, but can also create a performance bottleneck if not managed properly. It’s also important to note that synchronous communication creates tight coupling between all involved services. This pattern is ideal for scenarios that require immediate feedback, including simple and direct interactions.

But, many microservices interactions require complex interactions between multiple microservices. That requires a more complex communication pattern.

Asynchronous communication

Asynchronous communication patterns involve services interacting with each other without waiting for an immediate response. Common asynchronous communication patterns include message queues and streaming platforms.

  • Message queues, like RabbitMQ and Apache ActiveMQ, operate much like a message board. Instead of waiting in line to give a message, one microservice simply leaves a message in a queue. These messages are then processed by the receiving services when it is able.
  • Streaming platforms, like Apache Kafka, allow microservices to publish and subscribe to continuous flows of data, while at the same time providing a scalable and highly available service.

Asynchronous communication.png

When multiple microservices use a queue to asynchronously communicate (like in the above diagram), each is free to leave a message without waiting for an answer. The result is non-linear communication that does not require each service to wait before executing.

This pattern decouples services, enhancing scalability and fault tolerance. So, services can work independently from each other, mitigating potential bottlenecks. It does, however, introduce more complexity than synchronous communications.

Event-driven architecture

Event-driven architectures extend asynchronous communication by focusing on events, which are significant state changes associated with a point in time. Services publish events, then other services consume (or subscribe to) these “event streams” as needed.

  • Event publishing - When specific actions occur, such as changes in data or user actions services generate events.
  • Event consumption - To keep up with appropriate events, services subscribe to and process published events, allowing them to react to changes in real time.

When multiple microservices communicate through event-driven architecture (like in the diagram below), each service pulls data from and writes data to a central, shared message queue. This provides a flexible, scalable communication model with loose coupling.

Event-driven architecture.png

Once you’ve chosen the right communication pattern for your microservices, you need to settle on a communication protocol that fits the pattern.

Protocols and their roles

Protocols define the rules for data exchange and ensure interoperability between services. The most common ones used for inter-service communication are:

Synchronous communication protocols

  • HTTP/HTTPS - Commonly used with REST for straightforward web communication.
  • Protobuf (Protocol buffers) - Used with gRPC for efficient binary serialization.

Asynchronous communication protocols

  • AMQP (Advanced Message Queuing Protocol) - Often used with message queues for reliable message delivery.

Event driven protocols

  • Pub/sub - Used as a simple communication protocol for microservices.

Handling failure scenarios in communication

Simply picking the right communication isn’t enough to ensure robust inter-service communication. Without proper fault tolerance, communication between services is a weak point in the system, leading to cascading failures.

The following four strategies will help your engineering team build resilience into your inter-service communication.

Handling failure scenarios in communication.png

Retries

Retries is a simple strategy that automatically attempts to resend failed requests after a brief delay. This helps mitigate transient issues, such as temporary network glitches or brief service disruptions.

How it works

  • When a request fails, the service waits for a predefined interval before retrying the request.
  • The number of retry attempts and the delay between attempts can be configured based on the specific use case.
  • Exponential backoff can be used to gradually increase the delay between retries, reducing the load on the failing service.

When dialed in, retries smooth out temporary disruptions, so users aren’t aware of these small faults. It also frees your team from constantly having to manually intervene with requests by increasing the chance of success for every request.

Circuit breakers

Circuit breakers monitor the health of services and temporarily degrade or disable communication with services that are experiencing failures. This prevents one service outage from causing a chain reaction of cascading failures throughout the system.

How it works

  • The circuit breaker has three states: closed, open, and half-open. When the service is healthy the breaker stays closed so that requests flow normally. When a failure is detected in a service, the breaker opens. This blocks requests, allowing the service time to recover. After stipulations are met, the breaker will half-open, letting a limited number of requests through to test if the service has recovered.
  • If the service responds successfully during the half-open state, the circuit breaker closes, and normal traffic resumes. If failures continue, the circuit breaker reopens.

Excessive pressure can quickly overwhelm a service, leading to cascading faults. Circuit breakers prevent services from being overwhelmed, allowing them to recover by relieving load pressure. And, if they fail, circuit breakers increase the stability of a system by isolating these failures and rerouting traffic.

Timeout settings

Timeout settings define the maximum interval a service will wait for a response from another service before considering the request failed. With proper timeout configuration, you can mitigate prolonged delays as services wait too long for a response and the resource exhaustion that comes from that.

How it works

  • Each service call is assigned a timeout period.
  • If the response is not received within the timeout period, the request is aborted, and an error is returned.

Timeout prevents services from waiting indefinitely for responses, allowing the system to keep running without the drag of open requests piling up. It also helps you identify and handle slow services promptly.

Bulkheads

Bulkheads partition a system into isolated sections to prevent failures in one part from affecting the larger system. This is similar to compartmentalization in ship design, where individual sections can be sealed off to contain damage.

How it works

  • Resources (such as threads, memory, or database connections) are divided into separate pools, each of which is dedicated to specific services or functions.
  • This allows the majority of the system to continue to operate normally even if one pool becomes exhausted due to a failure.

By partitioning your system into well-designed bulkheads, you limit the impact of failures, ensuring the rest of your critical services remain available even when others are experiencing failures.

With the right communication patterns and robust strategies to handle failure systems, even the largest apps, like Spotify, can build in resilience.

How Spotify built resilience with an event-driven architecture

When Spotify transitioned to a microservices-based system, they adopted an event-driven architecture using Apache Kafka for inter-service communication. This gave Spotify the ability to build loose coupling, scalability, and fault tolerance into their microservices ecosystem.

Asynchronous, event-driven architecture

Spotify chose Kafka so their services could both publish and consume events asynchronously. This decoupled the services from each other, allowing each to evolve independently so Spotify could scale services as needed. Kafka's fault-tolerant and scalable design ensured that events were reliably delivered and processed, even in the face of failures or high loads.

To simplify integration, Spotify developed their own tooling and frameworks. They established guidelines and best practices for event design, schema evolution, and error handling to ensure consistent and reliable communication across services. This allowed engineering teams to build based on business logic rather than low-level communication details.

Building in resilience

Additionally, Spotify implemented advanced patterns like Event Sourcing and CQRS (Command Query Responsibility Segregation) to enhance the resilience and scalability of their system. While event sourcing allowed them to capture all state changes as a sequence of events, providing an audit trail and enabling event replay. CQRS separated read and write operations, optimizing for different access patterns and scalability requirements.

Scaling and evolving a popular app

As we see, Spotify successfully transitioned to a microservices-based system that could scale and evolve independently, while maintaining loose coupling and fault tolerance. This approach allowed them to handle their app’s massive scale and complexity – a topic we’re going to touch on a little later in our 10-part series on migrating from a monolith to microservices.

Looking ahead

Next week we’ll publish the next post in the series: “Service discovery and load balancing in a microservices architecture”.

Or, you can download the complete 10-part series in one e-book now: "Monolith to microservices migration: 10 critical challenges to consider"

Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team