Adopt time series pattern analysis – Enabling the Observability of Your Workloads

Mike Naughton | July 2nd, 2023


For some metrics, it is more relevant to observe the time series pattern instead of scalar values. This can boost proactive analysis when your operation team notices that the API error count spikes up every 10 minutes, but never crosses the threshold your alarms are set at.

You can also plot related time series data on the same graph to draw a relative understanding of how certain parameters have been evolving over time and if there is a correlation you need to have a deeper look into.

Annotate graphs with good and bad thresholds

Different applications could have different thresholds of what good or bad looks like. It might be ok to have CPU utilization for a compute-intensive batch workload consistently reported at 80%, but on the other hand, this might not fall into the expected utilization bands for another type of application. Furthermore, teams keep evolving, new members might get onboarded, and a few others might be offboarded at some point. When such threshold ranges are embedded and plotted together with underlying metrics, you can effortlessly identify outliers and what is not OK.

Plot the right dimensions

CloudWatch metrics work on the concept of dimensionality. Plain metrics are just data, but when we add a particular number of related dimensions, they get converted into information. How you utilize these dimensions can play an important role in bringing the most relevant information to the surface on your operational dashboards.

As an example, one dimension could be the total number of API errors occurring in a defined time window. But this could also mean that a single customer is consistently hitting an API endpoint with the wrong payload. Of course, this is not something you want to be paged on, so a better dimension to look at could be API errors occurring per customer. This can help with ruling out false positives, thereby saving precious investigation time.

Use consistent time zones across all systems

Almost every outage investigation starts with the time of event occurrence. This then serves as the key to searching related information across other dashboards, systems, and logs. As a rule of thumb, always opt for UTC references so that you are comparing data and investigations against the same time standard. This is particularly useful when you’re plotting different metrics on the same dashboard and writing logs with timestamps that are later pushed to a centralized logging platform.

Propagate trace identifiers

Adopting distributed architectures can mean that a single request flows through multiple systems or microservices before any response can be sent back to the user. Debugging such requests in the absence of related context can be painful. Therefore, it’s a good idea to attach a unique request identifier to the payload as it passes through multiple components. The respective logs of each microservice can then utilize the same trace identifier, along with additional metadata, such as the time it took to process the request.

This simplifies the debugging activities to a great extent as you could filter a particular request identifier and instantly gain visibility into how a particular request traversed the system. This is highlighted in Figure 8.2:

Figure 8.2 – Injecting trace identifiers in each request

In addition to trace identifiers propagating request metadata, it’s also important that all the components of the system make the user aware of their health.

Leave a Reply

Your email address will not be published. Required fields are marked *