RSS Feed Subscribe to RSS Feed


Post Production Debugging

Monitoring and Observing Your App Post Release

Pre-release tests are essential, but the ability to debug, monitor and observe your application suite post-release is what allows you to detect, and quickly fix, the production problems that will inevitably rise.


Much has been written about how to ensure quality in the software we write and deploy to production. Unit tests, integration tests, PACT and consumer driven contracts, manual and exploratory testing done by QA teams. And the pre-production phase of testing is something I’ve focussed on a lot too. I’ve given talks on testing and quality and blogged on Test DoublesTesting exceptions and various open-source testing libraries including AssertJFESTHamcrest and Easymock.

This post however, is essentially about testing and debugging in production.


This is not an original work. I have heavily used several sources, particularly:

And other sources listed in the references section. Those articles were the inspiration for this post, and in some cases I may have simply copy and pasted parts (though I have tried not to). In the worst cases I have bastardized the sources into points the original authors likely never intended.  I hope they can forgive me. They say imitation is the greatest form of flattery 🙂 I’ve found the best way for me personally to understand something is to copy, paste, slice, dice and rehash…

Embracing failure

As we use increasingly complex tech stacks, architectures and cloud deployments, the number of things that can go wrong also increases. We live in an era when failure is the norm: a case of when, not if. And things will break in ways you haven’t imagined.

Failures are easiest to deal with (detect, diagnose and recover from) when they are explicit and fatal. A fatal error stops the system. No more insanity or unpredictable behavior. Maybe you’ll get a core dump to assist with debugging. As The Pragmatic Programmer put it: “Dead programs tell no lies! Crash, Don’t Trash.”

Non fatal and implicit errors are much harder to deal with however.
Non fatal errors, when a system continues despite failure, may cause cascading problems and data corruption. Implicit errors, when a system continues to operate “correctly” but not “well” (e.g. slow), can be especially difficult to debug and find the root cause for.

A small error in one service cascades and causes catastrophic failures in another. Butterflies cause hurricanes. (See The Hurricane’s Butterfly)

Whatever the error type (implicit, explicit, predicted, or a not-so-pleasant surprise), we need to embrace the failures by designing services to behave gracefully when failure inevitably happens.

Avoid failures when you can, gracefully degrade if you can’t, tolerate the unavoidable, and overall try to build systems to be debuggable (or “observable”) in production for when all hell breaks loose.

Avoiding failures can be as simple as retrying. Graceful degradation can include techniques such as, timeouts, circuit breaking (e.g. hysterix), rate limiting and
load shedding. Tolerating failure can include mechanisms such as region failover and
eventual consistency (it’s OK if we can’t do it now, we’ll do it eventually) and multi-tiered caching (e.g. if you relying on a data store and its down, can you write to a cache as an interim alternative?). (See more at Monitoring in the time of Cloud Native)

Quantify the wellness of your app

All of this tolerance however comes at the cost of increased overall complexity, and the corresponding difficulty of reasoning about the system. It also limits how fast new products & features can be developed/released and increases dev costs.

So what should we aim for? Judiciously and proactively decide on trade offs. The risk the business is willing to handle versus the expense of building ever more available and concomitantly complex services. Make a service reliable enough, but no more.

So, how do you define what your goals are? Well, a good place to start is by recognizing and acknowledging where you are! What is you current level of performance? There is little point aiming for “five nines” when you are currently down for hours every day.

And a good place to start to measure your current state, or the wellness of your system, is to use KPIs.


KPIs, or Key Performance Indicators, are basically important metrics about your system.

Some commonly used KPIs are:

  • Number of Users
  • Requests Per Second
  • Response Time
  • Latency

Also, and what we will focus on here,

  • Number of errors
  • Mean Time to Detect and Restore (MTTD/MTTR)
  • Application Performance Index (Apdex)



Error counts (along with code coverage), are among the most common metrics used to monitor software quality. And I get that errors are important and should be dealt with. “No broken windows” (to reference The Pragmatic Programmer, again). But error counts may not be the best, and are certainly not the only, metrics for examining the health of your system.

For example, if an error occurs in your infrastructure but there is no user impact, do you care?
There are other metrics than errors to be concerned about, and indeed I think are more important.


MTTD: Mean Time To Detect

MTTR: Mean Time To Restore

How do you record MTTD and MTTR? If you have automated monitoring tools that can detect the downtime and subsequent service restoration, great, but manually recording is also an option. Either way, documenting the times is a key part of an RCA (Root Cause Analysis).

Calculating involves recording three key event times:

  • Problem start time (start)
  • Problem detection time (detect)
  • Problem resolution time (resolve)

And then calculating as follows:

  • MTTD = detect – start
  • MTTR = resolve – start

Note you could instead calculate TTR as “resolve – detect”, as this Splunk article suggests, rather than the “resolve – start” I am recommending above, and as this NextService article suggests. Whatever works for you.

A final point on MTTR is that it really helps to have CI/CD in place!


Another related metric is MTTF (Mean time to failure). How much time passes between failures?

Given the choice between tracking MTTF and MTTR, you should track MTTR. Why is MTTR more valuable?

1) Perception.
* If you’re site is down for 24 hours
* If you site is down for less than a second several times every day
People may never notice the latter, or at least be not inconvenienced by it.
Everyone, including your customers, your CEO, and perhaps your stock price, will notice the former

2) Wrong Incentives.
The best way to keep a system stable is to never change it! This may incentivize teams to release less often. But releasing less often is the anithesis of the DevOps movement.

(See more at



Apdex, or Application Performance Index, is a score indicating how certain measurements of performance meet pre-defined targets.

New Relic define Apdex as:

An industry standard to measure users’ satisfaction with the response time of web applications and services. It’s a simplified Service Level Agreement (SLA) solution that gives application owners better insight into how satisfied users are.

Apdex is a measure of response time based against a set threshold. It measures the ratio of satisfactory response times to unsatisfactory response times. The response time is measured from an asset request to completed delivery back to the requestor.”

The application owner defines a response time threshold T. All responses handled in T or less time satisfy the user. You can define Apdex T values for each of your New Relic apps, with separate values for APM apps and Browser apps. You can also define individual Apdex T thresholds for key transactions.

For example, if T is 1.2 seconds and a response completes in 0.5 seconds, then the user is satisfied. All responses greater than 1.2 seconds dissatisfy the user. Responses greater than 4.8 seconds frustrate the user.


OK, so we have discussed KPIs. SLAs, or Service Level Agreements, are an agreement between a service provider and a client on the acceptable values of those measurements. For example: We will be 99.9% available on average in again given month, or we will refund some portion of your subscription fee. See my post on AWS S3 SLAs for example.

Since SLAs typically deal with some form of compensation if the agreement is not met by the service provider, I don’t think they are particularly suited for use with internal microservices. KPIs may be a better fit there.

One thing to bear in mind about SLAs is that if you have several components that you depend on, your availability is a product of, and hence always less than, those dependent’s SLAs.

This can be calculated as follows:

(source: In search of five nines)

Availability Component 1 = Ac1
Availability Component 2 = Ac2
Availability Component n = Acn
Availability System (As) = Ac1 * Ac2 * … Acn

e.g. if your service depends on AWS Aurora and AWS S3
Availability S3 = 99.9
Availability Aurora = 99.95
Availability System (As)
= 99.9 * 99.95
= 99.85


OK, great, so you know what KPIs and SLAs are.

What if your KPIs are in fact indicating non performance? Your Apdex is much closer to 0 than 1? Things are slow, erratic, or plain old failing? To use the analogy from earlier, given a hurricane, how do you find the butterfly?

Monitoring and Observability


I think it is safe to say that monitoring (and the logging and alerting that it entails) are very mainstream.

Some definitions of monitoring may include:

  • Observe and check the progress or quality of something over a period of time;
  • Keep under systematic review
  • Maintain regular surveillance over

For our services, we often “monitor” something to:

  • confirm it is acting in an expected way i.e., report the overall health of systems
  • to check if it is failing in a specific manner e.g. splunk alerts for a specific error message

This ties in with the “explicit” errors we were discussing earlier. You log specific errors, and alerts on those specific errors. But monitoring for expected success and expected failures only gets you so far.
What about the unexpected? Maybe you can just create a splunk query for”:

“host=*myservice* error”

But this approach doesn’t work so well when failures become less well defined, more nebulous. When failures become more implicit. Step in Observability…



Observability is about being able to understand how a system is behaving in production. An observable system makes enough data available about itself that you can generate information to answer questions that you had never even thought of when the system was built.

Some definitions of observability that I’ve seen include:

  • Provide highly granular insights into the behavior of systems along with rich context,
  • Provide visibility into implicit failure modes
  • Provide on the fly generation of information required for debugging


Another way think of observability is monitoring (as described above), plus the ability to debug, understand and analyze dependencies (source Cindy Sridharan’s Monitoring in the time of Cloud Native)

  • Debugging: debug unepected, rare and/or implicit failure modes
  • Understanding: using the data to understand our system as it exists today, even during normal, steady state. e.g. How many requests do we receive per day and what is the typical response time?
  • Dependency analysis: Understand service dependencies. Is my service being impacted by another service, or worse, contributing to the poor performance of another service.

The Three Pillars of Observability

The three pillars of observability are

  1. Logs
  2. Metrics
  3. Tracing


Start with logging. “Some event happened here!”. We log key events as a easy way to track an app’s activity. Logs are a basic and ubiquitous aspect of software, so I am not going to cover them much here.


How many of these events happened? Metrics are simply the measurement of something; data combined (“statistical aggregates”) from measuring events, that can be used to identify trends.

You can record metrics for almost anything you are interested in, but the four golden signals (à la Google SRE-book) are : latency, saturation, traffic, and errors.


Track the impact of events; Request scoped causal info; did the event cause upstream/downstream impact?

Tracing is about tracking a request through a system. For example, from web server though various microservices, to your database, and back. As well as tracking the request path however, a trace can also provide visibility into the request structure. By that I mean you can track the bifurcation/forking and asynchronicity inherent in multi threaded environments.

Tracing is often supported through the use of Correlation IDs (also known as a Transit IDs). That is random identifiers (e.g. GUIDs) generated an entry point to (as as early as possibly in) a distributed system, and passed through each layer as a way of linking the multiple calls that constitute the lifecycle of a request. Correlation IDs are an integral part of microservices. As Sam Newman, author of the excellent Building Microservicesput it:

Tracing can also be provided by some APMs. For example transaction traces in New Relic.


Finally, I like this diagram from Peter Bourgon that shows the relationship of the 3 components of Obervability:



Monitoring vs Observability

You can think of monitoring as being more proactive. You write code that logs specific messages and errors, and create alerts around those messages and errors. At service deployment time for example, it is not unusual to keep an eye on the logs for the messages that reflect the system is acting as expected, or for the “known” errors.

Observability is more reactive. You “observe” the system in production, trying to debug and understand it, particularly when things go wrong.

According to Charity Majors, CEO at @honeycombio,

Observability is about the unknown unknowns. A system is observable when you can ask any arbitrary question about it and dive deep, explore, follow breadcrumbs.

And while monitoring and observability have many things in common, the nuanced terminology can be very useful for distinguishing use cases, and maturity. Or not:

I personally think the terminology is useful, but that mastering the tools needed is more important. It doesn’t matter what you call it, but when you are dealing with a production outage, being able to quickly detect, diagnose and resolve is critical.

And on that point, the following tools are on my reading list:

  • Zipkin
  • Prometheus

And I need to further develop my skills in

  • Splunk
  • New Relic


Sources and References

Tags: , , , , , , , , , ,

Leave a Reply