AWS Observability Best Practices for Your Application

Achieving application observability: the ability to gain insight into an app's inner operation, is a widespread business challenge aggravated by a common problem: effective instrumentation. While observability is increasingly important, the way to achieving it can be obstructed by the intricacies of instrumenting effectively.

The article will lead you to Amazon's best practices for overcoming this challenge, enabling organizations to gather and analyze their app’s information in order to gain insights as well as take appropriate actions to tackle issues and ensure optimal app performance. Upon reading, you will be able to apply the following knowledge to your app’s merit:

why instrumentation matters for observability
how to handle cardinality within visibility
what tools and services AWS offers for instrumentation
what best practices Amazon promotes for achieving visibility

by Serhiy Kozlov

•

November 20, 2023

•

21 min read

Although businesses across industries and locations differ in many respects, what we all share are priorities: delivering value through a safeguarded app with high performance despite inherent technological constraints. Application observability is basically a quest for service functioning, ultimately fine-tuned with the customers’ requirements and stakeholders’ expectations, which springs from a profound understanding of how a given app operates.

The concept of observability is gaining momentum due to its high applicability, especially for more complicated systems, such as microservices or distributed systems. When approaching the complexities of app observability, one can rely on the established best practices that we are going to thoroughly discuss in the course of this article so as to enable businesses to analyze their app’s data and use them to take measures to resolve and prevent issues and enhance app’s performance.

We now invite you to join us in the exploration of the instrumentation best practices whose incorporation into your observability strategy is bound to make a tangible difference to your understanding of an application and its maintenance.

Table of Contents

Navigating the instrumentation options

Instrumentation in application observability is perceived as the process of adding code, tools, or agents to your software applications and systems with the purpose of collecting information and generating insights about their behavior. Strategically placing monitoring points within your application’s code or infrastructure, instrumentation enables capturing critical data, such as metrics, logs, and traces.

When it comes to instrumentation, one is free to choose from a variety of options, but the idea worth taking into account is that instrumentation entails a collaborative effort, and the optimal results are achieved on condition of mutual contributions on both sides: of the cloud provider and the customer.

Why instrument with Lambda, a serverless service

We are now going to illustrate some of the services that Amazon suggests for conducting instrumentation. Let us plunge into details, using a serverless computing service, such as Lambda as our first example. In Amazon’s documentation for every service, you will find essential information, including:

Available Statistics
AWS clarifies which statistics are accessible for each metric, such as sums, percentiles, or averages. Your key to effective interpretation is placing the correct focus.
Metric Explanation
Amazon provides simplified descriptions of what each metric measures, ensuring clarity in understanding their significance even among non-technicians.
Dimensions
Many metrics span multiple dimensions, which are important to grasp. For instance, in the AWS synthetic testing service, a canary has a duration metric. If the canary contains test steps, each step has its own duration metric.

Speaking of data, various services can send logs to CloudWatch or even S3; this is also true for Lambda which sends metrics and logs automatically and whose initial setup is hassle-free. Tracing is just as simple in Lambda, and enabling X-Ray tracing, for example, allows you to capture traces every time your Lambda function runs: and it is about to get even more seamless.

Instrumenting Lambda functions is quite user-friendly. It can be represented as writing down notes, with a console tool to log some data, like important info or warnings. These logs, i.e. measurements of how well your function is doing, can be sent to the AWS CloudWatch, where one can review them, in a number of ways. However, for efficiency, we recommend using something called the embedded metric format, which we are going to recount later.

It’s high time we discussed tracing from the perspective of Lambda functions. As it has been mentioned earlier, enabling logging, and now tracing is quite straightforward in Lambda: marking a checkbox suffices for sharing traces with X-Ray. That being said, you will still need to do some instrumentation on your own (your share of responsibility as we have agreed on previously) if your Lambda function performs various tasks like putting data in an S3 bucket or calling another Lambda function. A little handy heads-up: you can ensure that traces follow the path of your function by using the X-Ray SDK and wrapping your code accordingly.

Instrumenting with containers

So far, we have been going on about serverless computing services, with the example of Lambda. From now onwards we are going to shift to containers. Containers offer a lot of possibilities, but a one-size-fits-all solution that covers all aspects of observability: logs, metrics, and traces, does not exist. Though, here’s a breakdown of some options:

AWS Distro for OpenTelemetry: this option can collect metrics and traces, offering a solid foundation for observability.
The Cloud: while the cloud environment provides various tools, it does not deal with logs directly.
CloudWatch Agent: the tool effectively collects logs and metrics but does not perform trace collection.
Fluent Bit and FireLens: these tools are primarily focused on collecting logs; they do not handle metrics or traces.

As we can see, there is currently no single solution that seamlessly handles all three aspects of instrumentation. We are going to share some recommendations concerning the choice of a suitable instrumentation agent. For now, we suggest beginning with OpenTelemetry for metrics and traces and then selecting additional tools based on your log management needs.

OpenTelemetry for your metrics and traces

OpenTelemetry is an open standard for gathering observability data, such as metrics, logs, and traces. Currently, metrics and traces are generally available (GA), while logs are still in draft status within the specification.

Though collecting logs in OpenTelemetry is not yet generally available, the tool aims to streamline the collection of these signals by providing a unified approach. This involves instrumenting your application, whether manually or automatically, to gather system logs, infrastructure metrics, application logs, tracing data, and application metrics.

The tool’s feature which proves extremely useful allows channeling all three signal types (metrics, logs, and traces) into an OpenTelemetry collector, such as AWS’s ADOT (AWS Open Distro for OpenTelemetry). Receiving these signals in a standardized format, the collector optionally enriches them and then allows you to export the data to various locations, including options like X-Ray, Jaeger, or Zipkin.

One of the outstanding features of OpenTelemetry is its flexibility. With its help, the user is never locked into a proprietary system for data collection. Quite the opposite: you can collect the information once and distribute it to multiple destinations. On the one hand, this centralized approach allows for correlation of data. On the other, you are given a choice in your observability setup according to what suits your workload best: OpenTelemetry distributes through different collectors, like the already mentioned AWS Open Distro, and supports different logging agents, such as FireLens, Fluent Bit, or CloudWatch.

Capturing the important telemetry

Amazon posits a rigorous approach to collecting telemetry for every application and service. They capture information for every unit of work or request, be it an HTTP request or a message from a queue. These telemetry data are logged in a structured format known as a “request data log.”

Operating a large-scale e-commerce site requires thorough visibility into the performance and issues of your code. If the current code implementation lacks crucial insights, it leaves you with many questions when problems arise, such as request failures and slow performance. The thing is that the code does not provide information about the request’s purpose, its caching status, database interactions, reasons behind timeouts, etc. These gaps are bound to hinder effective operation at scale.

At Amazon, they address this challenge by using a common metrics library across the company. When a service receives a new request, it instantiates a metrics object. This object collects vital data about each request, including details about the request, cache usage, database interactions, and timeout distinctions. This approach enhances operational efficiency without significantly increasing the codebase, as it offers comprehensive insights for diagnosing and addressing issues when they occur.

As to the technical side of it, the metrics object is passed around systematically throughout the codebase. It is used with cache libraries and remote call SDKs, which are also designed to interface with this metrics system. This collaboration allows for automatic instrumentation of requests, gradually accumulating more instrumentation data as the request progresses; this is because the libraries and SDKs seamlessly integrate with the metrics API.

On top of this, the code incorporates facts that describe what the request was handling and also serve as contextual information for debugging purposes.

All this instrumentation generates a rich recorded dataset, featuring essential details about the request, such as the requester’s identity, source, trace ID, and data about the infrastructure, including instance ID, Availability Zone, and the specific node involved. There is also a wealth of timing and measurement data, encompassing the duration of various actions and the quantity of processed items. Essentially, this instrumentation provides both quantitative measurements and qualitative facts about the request and its execution.

The embedded metric format configuration

In addition to measurements and facts, there are attributes in the data, which are represented as strings. There is also an introduction of the embedded metric format, which allows the selection of specific properties in logs to be converted into metrics. The metric definition specifies the namespace for grouping metrics and indicates that the “time” property should be transformed into a metric.

To illustrate this concept, let us consider using a tool like CloudWatch to visualize the metrics. In this example, we have an embedded metric definition in a log originating from the “product info service.” It defines the metric’s namespace, does not include dimensions, and utilizes the “time” attribute from the log as the source for the metric. You can configure units for displaying the metric, and the actual measurement contributes to the plotted line.

However, when a spike in the metric occurs, such as indicating a slowdown in the “product info service,” it does not provide insights into the underlying reasons for this performance issue: it merely signals that an issue has occurred.

In order to understand the root cause of such issues, it is essential to step back and examine the architecture of the “product info service.” The latter involves various operations on its API, such as “get product” using DynamoDB, “update product” using a queue, and “search product” using a search index. Each of these APIs has its own set of dependencies and potential issues, so it is commonsensical to separate them for better visibility.

This is where metric dimensions become valuable. The best-case scenario involves grouping metrics by both the “product info service” namespace and the specific operation, such as “get,” “search,” and “update.” Achieving this requires adding an operation attribute to each request in your log, indicating which API it corresponds to. In the embedded metric format configuration, you can specify the dimension as the operation attribute. This setup ensures that each unique string found in the operation field creates its distinct metric, allowing you to break down telemetry data by operation.

Using dimensions in this way allows for capturing detailed telemetry from your application, inclusive of custom business metrics, which can then be transformed into measurements for your dashboards and alarms, providing deeper visibility and understanding of your system’s behavior.

It is worth noting that there are various observability platforms with their own methods for capturing metrics at scale. In the context of the metrics library mentioned, we recommend the open-source client library called Embedded Metric Format for this purpose.

Dashboards

In Amazon, dashboards play a crucial role in providing a quick and focused view of what is happening within a system from a specific perspective. For instance, stakeholders or product managers might require dashboards that concentrate on the user experience.

A common mistake we have observed, though, is the inclination to overload dashboards with excessive information: cluttering them with too much data can hinder their effectiveness, especially when they are needed most during critical operational situations. Dashboards are most valuable during operational events when various team members take on different roles to address and resolve issues. So, when designing dashboards, simplicity should be the guiding principle.

Each type of dashboard serves a unique purpose within an organization’s operational and monitoring framework, catering to the needs of different stakeholders and scenarios. Some of the types are described below.

Customer experience dashboards

Offering a high-level view of customer experience, such dashboards aid communication among business leaders, tech teams, and customers. They focus on key aspects of customer experience, highlighting the impact of the actions taken on the end user.

System-level dashboards

Dedicated to web-based services, service-level dashboards provide engineers with data on system performance, especially customer-facing endpoints accessed via UI or APIs.

Microservice dashboards

This type of dashboards not only enables a quick assessment of customer experience within individual services, ensuring engineers stay focused during operational events but also tracks dependencies between microservices.

Capacity dashboards

Used for resource and service monitoring, capacity dashboards come in handy in long-term capacity planning, ensuring that teams have sufficient computing and storage resources.

These were just some of the available dashboards, in fact, there are many more. As we have stated, visibility is not about cramming your dashboards with loads of different metrics, nor is it about employing all possible kinds of them.

What could really make a difference with your system observability, though, is a culture of continuous improvement. This basically comes down to regularly incorporating lessons learned from past events. One effective way to achieve this is by managing your dashboards programmatically using an infrastructure-as-code approach.

As an example, a part of the continuous development culture is AWS’s routine of refining dashboards. This random selection of the services to be audited ensures that all teams are prepared to discuss their dashboards during operational reviews, promoting proactive readiness.

Optimizing the cost of high cardinality

High cardinality is a vital concept in observability, affecting both insights and costs. It involves a large number of unique dimension combinations that can lead to a significant increase in potential metrics. We are going to look into the implications of high cardinality in observability further.

Metrics and dimensions
When measuring a latency metric with dimensions (e.g. Request ID, Customer ID, and Service), the latter bring about high cardinality when using services like PutItem and ListItems due to a large number of unique dimension combinations and individual metrics.

Traditional vs. modern architecture
High cardinality becomes more prominent as systems evolve. Traditional monoliths may have around a thousand metrics, whereas modern microservices can result in millions of potential metrics.

Costs of high cardinality
Managing vast metrics leads to significant costs, such as nearly a quarter million dollars per month for 10 million metrics. This applies broadly, not just to CloudWatch.

Balancing metrics and logs
While logs are cost-effective, metrics offer faster insights. Making informed choices between them is key. Still, applying both approaches will let you optimize costs and insights.

These were just some explanations about the high cardinality that increasing observability brings along. However, high cardinality can be effectively managed, and AWS offers solutions like CloudWatch’s Embedded Metric Format to do so while also controlling costs. By selecting the appropriate combination of metrics and logs, employing efficient formats, and following best practices, you can efficiently handle high cardinality, ensuring effective observability across diverse workloads.

Best practices

Identify potential high cardinality dimensions
High cardinality involves dimensions with numerous unique values, such as user IDs, request paths, or resource names, which are associated with increased storage and query costs. At the same time, identifying which attributes drive high cardinality helps make informed decisions on dimension management and analysis.
Ingest your telemetry as a log first
A cost-effective approach to managing high cardinality data is ingesting telemetry as logs before converting them into metrics. This is backed up by the fact that logs offer greater flexibility for capturing detailed data, including high cardinality dimensions, without the expense associated with high cardinality metrics. What is more, storing logs for extended periods does not incur the same storage costs as metrics, which are billed based on unique dimensions. Starting with logs allows you to align your decisions on which dimensions to elevate to high cardinality metrics with your specific analysis requirements.
Then create metrics using appropriate dimensions
Once you have ingested telemetry as logs and pinpointed the most relevant high cardinality dimensions, you can purposefully generate metrics with those dimensions. This approach prevents the creation of metrics for all dimensions immediately, which can result in unnecessary expenses. Crafting metrics for dimensions that offer valuable insights and align with your observability objectives instead means prioritizing crucial dimensions, optimizing metric usage, and minimizing the cost impact of high cardinality while collecting relevant information.
If you use CloudWatch, leverage the embedded metric format
AWS CloudWatch provides an embedded metric format, allowing you to craft custom metrics with high cardinality dimensions directly from your log data. With the tool, you can define custom metrics using structured logs, which reduces the necessity to create high cardinality metrics in advance. This format lets you manage costs by selectively deciding which dimensions to elevate to metrics, avoiding unnecessary metric generation. In this way, you can have the best of both worlds: log flexibility with metric efficiency, and optimizing costs while ensuring observability.

Reducing alarms fatigue

Alarm alert strategy matters a big deal as the main communication means between your business and its technical side. Alerts are there not merely to be triggered, but to lead to actionable steps. So it proves important to be deliberate about alerts as they can potentially impact your business outcomes significantly. A detrimental scenario includes alert fatigue and excessive alarm noise: having too many alerts can make it easy to overlook critical matters.

To ensure clarity, one had better define expected actions to follow alerts, with the help of playbooks: they guide anyone responding to an alert, regardless of their experience or familiarity with the business, to follow the established steps precisely. It is a good idea to remediate issues through runbooks for automated or semi-automated solutions whenever possible.

Best practices

Alarm on key performance indicators and workload metrics
Identify the key performance indicators (KPIs) that align with your application’s objectives and user expectations; simultaneously, determine the critical metric for your service or application (e.g. a high-volume e-commerce website may prioritize tracking metrics such as order processing rates, page load speed, or search latency). Having done so, you will not lose your strategic goals when guided by KPIs and your workload metrics.Then, an effective and straightforward way to enhance your alarm system is by implementing synthetic testing, and AWS offers a solution called CloudWatch Synthetics to suit this very need.
Alert when your workload outcomes are at risk.
Configured to be triggered by critical metrics only, alarms will activate when your workload’s performance is in jeopardy. This strategy prevents alert fatigue by minimizing unwarranted alerts and flagging only substantial deviations from desired results. CloudWatch’s synthetic canaries are an effective method for achieving this since they simulate user interactions with your system.

Create alarms on dynamic thresholds instead of static ones.
For app workloads that are dynamic, with fluctuating activity levels, consider employing dynamic thresholds that adapt to the workload’s typical behavior instead of setting static thresholds. Relevant thresholds that accommodate fluctuations and automatically adjust to workload patterns can be set upon analyzing historical data. This way, you reduce false alarms and the volume of unnecessary notifications, limiting those to exclusively pertinent ones.
Correlate your alarms and notify a person when that correlation happens
While you can have numerous alarms, it is essential to ensure that alarms within a composite alarm do not trigger notifications as well. Composite alarms in CloudWatch, and similar features in other observability platforms, allow you to use operators like “and,” “or,” and “not” to merge multiple metric alarms. These composite alarms notify you when something unexpected occurs and enable you to correlate alarms effectively, yet do not produce excessive notifications.
Leverage machine learning algorithms to detect anomalies
Machine learning can be used to develop advanced anomaly detection models that understand your application’s metric patterns and detect genuine deviations from the expected behavior. These models adapt and comprehend intricate metric relationships, resulting in more precise and timely alerts. This approach helps mitigate alarm fatigue and enhances alert accuracy.

Avoiding dangling traces

Though it is commonly difficult to grasp the concept behind tracing, we will still need a general working understanding of it so as to get a foot in the observability door.

Basically, tracing in systems like X-Ray is facilitated through a Trace ID, which consists of three components. Firstly, there is the route, which serves as the identifier that links everything together. The second one is the optional parent component, typically empty if no upstream activity is involved. Lastly, there’s a sampling decision. These constitute the basis of X-Ray traces.

As to the functioning of tracing, consider a common scenario involving API Gateway, Lambda functions, and S3 buckets. The ideal trace captures all three of these steps, each represented as circles. In the context of OpenTelemetry, these steps are referred to as “Spans,” while X-Ray uses the term “Segments,” though they both can be used to describe similar concepts within tracing.

The following example is bound to facilitate our understanding of tracing. The latter involves the use of a Span and a Trace ID. Suppose our illustration involves Trace ID 123. The initial span, initiated by the API Gateway, might have a Span ID of ABC. When the API Gateway communicates with Lambda, it automatically passes the trace context, including Trace ID 123 and upstream parent ABC, to Lambda. Lambda then generates its own Span ID or Segment ID, such as DEF, and passes this context further downstream. In this case, the Trace ID remains 123, but the parent changes as it moves through different components.

Let us go further to imagine a scenario with an ECS Fargate container pushing data to Amazon MQ (RabbitMQ), but there’s no OpenTelemetry tracing in RabbitMQ. Consequently, no tracing context is sent to RabbitMQ. In another ECS Fargate container that pulls data from the queue, there is no context to work with. This results in a dangling trace issue, where you see disconnected Lambdas without a trace connection between them. This case, however, can and should be treated. Typical ways of resolving a dangling trace problem are presented in the following subsection.

Best practices

Instrument all your code
If you wish to have visibility into every single step of your app’s functioning, you had better go for instrumenting all of the code. One can capture and trace each operation, counting functions, methods, and interactions with AWS services, by inserting code snippets into the application to generate and transmit traces to the tracing system. Instrumenting all the way through will generate enough data to track the journey of requests and responses, facilitating the detection of bottlenecks, failures, or delays within your application, which is the first step towards tackling those emergent issues.
Understand which AWS services support tracing and how they do so
Being in the know about specific AWS services that inherently support tracing can prove to be of great advantage whereas understanding how such services interact with your selected tracing system – even more so. You win a lot when you are ready to utilize the Amazon services’ native features, such as autonomous trace creation and seamless transfer, which diminish the risk of dangling traces in scenarios involving interactions between different services and tracing systems.
In case a service does not support tracing:
– pass that trace context across that service boundary
If a service does not include inherent tracing support, one absolutely needs to manually transmit trace context across its boundariesfor fear of interrupting trace continuity and losing essential data. Simple incorporation of the trace ID and pertinent details within the request or message will guarantee the trace context preservation.

– resume the context on the other side of the service boundary
A non-tracing-supported service on the receiving end will require you to extract the trace context from the incoming request or message and resume the trace. Continuing the trace context by filling the gap in trace information still allows tracking the path of a particular request within your app, even if segments of that journey involve services that do not provide native tracing.

Serhiy Kozlov CEO, Romexsoft

Share The Post