The value of monitoring and observability in microservices, and associated challenges

Published by Emre Baran on December 02, 2024
The value of monitoring and observability in microservices, and associated challenges

Transitioning from a monolithic architecture to microservices is an intricate, time-consuming task. It demands both strategic foresight and meticulous execution.

In this 10-part series, we’ll guide you through the most common challenges faced during monolith to microservices migration. Last week we published the fourth part of our series, on to design service discovery and load balancing in a microservices architecture. This week, we are going to dive into monitoring and observability in microservices.

We'll publish a new article every Monday, so stay tuned. You can also download the full 10-part series to guide your monolith-to-microservices migration.

Effective monitoring and observability brief intro

In a microservices environment, multiple services run concurrently, creating a complex network of processes. That makes it difficult to get a clear view of the overall health, performance, and behaviour of the application. And it often renders obsolete many traditional tools used in monolithic architectures.

That’s why effective monitoring and observability tools are critical for understanding what is happening at each layer of your application.

  • Observability tools provide insights into the internal state of the system. This makes it easier to understand and debug the complex interactions between services.
  • Monitoring tools collect and analyze metrics, logs, and traces from various services. This gives you critical data you can use to identify potential issues, bottlenecks, and anomalies

Challenges of implementing monitoring and observability in microservices architectures

There are three major challenges companies have to overcome before achieving effective monitoring and observability.

  1. Interaction of data silos. Treating each microservice separately when implementing monitoring and observability solutions creates “data silos”. These silos are easy to understand in isolation, without fully understanding how they interact as one. This can lead to difficulty when debugging or understanding the root cause of problems.
  2. Scalability. As your microservices architecture scales, the complexity of monitoring and observability grows with it. So monitoring everything with the same tools you were using for a single monolith quickly becomes unmanageable.
  3. Lack of standard tools. One of the benefits of microservices is that different teams can choose the data storage system that makes the most sense for their microservice (as we covered in blog 2 of the series, "Data management and consistency"). But, if you don’t have a standard for monitoring and observability, tying siloed insights together to gain insights on the system as a whole is challenging.

The foundation of monitoring and observability

To achieve effective monitoring and observability in a microservices environment, start with the three pillars of observability: metrics, logging, and tracing. Once the basics are established, you can move into enrichment, correlation, arbitrary querying, and more.

The foundation of monitoring and observability.png

Metrics

Metrics provide quantitative measurements of various aspects of the system, such as response times, error rates, resource utilization, and throughput. By collecting and analyzing these metrics, you can assess the performance and health of individual services and the system as a whole.

Prometheus, Grafana, InfluxDB, and Datadog are all popular tools for collecting and visualizing metrics. They help define and collect metrics from services, set up alerts and thresholds, and create dashboards for real-time monitoring.

Logging

Capturing and centralizing log messages generated by services during their execution provides valuable information about the behaviour of services, including error messages, debug information, and important events. This data can be saved to keep a record of performance or used as input for analytics tooling.

Centralized logging solutions like the ELK Stack (Elasticsearch, Logstash, and Kibana) or Fluentd help aggregate and analyze logs across microservice architectures. They enable you to search, filter, and visualize log data, making it easier to troubleshoot issues and understand the flow of requests through the system.

Tracing

Tracing involves capturing the end-to-end flow of requests as they traverse across multiple services. Tracing gives your team a better understanding of interactions between services so they can identify performance bottlenecks and pinpoint the root cause of issues.

Distributed tracing tools like Jaeger, Zipkin, and OpenTelemetry allow you to capture the timing and metadata of requests as they flow through the system. This provides a detailed view of the request lifecycle.

Metrics, logging, and tracing tools provide comprehensive visibility into your microservices system. When you implement these tools you are better able to monitor the health and performance of services, detect anomalies, and troubleshoot issues efficiently. Now let’s talk about some examples.

Uber maintains observability through growth

Uber experienced significant growth in their microservices architecture. This led to challenges in maintaining clear monitoring and observability. So, Uber focussed on building up the three pillars of monitoring and observability by implementing a robust stack of various open-source tools. This gave them a better understanding of what was happening across their app.

Metrics

For metrics, Uber used Prometheus to collect and store data and Grafana to visualize the results. Prometheus, a time-series database and monitoring system, was chosen for its flexible query language. This flexibility allowed Uber to define custom metrics relevant to their business. Grafana was used to create interactive dashboards and alerts based on the collected metrics.

Logging

Uber used Apache Kafka and Elasticsearch to build a centralized logging infrastructure that could handle their growth. Services published their logs to Kafka topics, which acted as a buffering layer. The logs were then consumed by a log aggregation pipeline that processed, transformed, and stored them in Elasticsearch. To make the data easier to search and more digestible, Uber used Kibana to visualize the log data.

Tracing

For distributed tracing, Uber initially used Zipkin but later transitioned to Jaeger, an open-source tracing system. Jaeger allowed Uber to instrument their services to generate trace data and provided a web UI for visualizing and analyzing traces. It helped them understand the flow of requests through their microservices architecture, identify performance bottlenecks, and debug issues.

Standardizing metrics, logging, and tracing with custom tools

With a variety of tools recording data on how the architecture acted at multiple levels, Uber had to bring this information together to get observability in architecture as a whole. They decided to develop custom tools and dashboards to standardize metrics, logs and traces across the system. This gave their team a holistic view of their system's health and performance. They can correlate data from different sources and gain insights into the behavior of their microservices.

metrics, logging and tracing .png

Building up by focusing on the foundation

Uber re-gained deep visibility into their microservices architecture by revisiting the three pillars of observability and monitoring. With this focus on re-working the foundation of visibility, they are now able to proactively identify and resolve issues, optimize performance, and ensure a reliable and seamless experience for their users.

Looking ahead

Ready for the next in the series? Continue to “Testing and deployment strategies in the microservices architecture”.

Or, you can download the complete 10-part series in one e-book now: "Monolith to microservices migration: 10 critical challenges to consider".

Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team