Live #1 - Monitoring - Is it a Feature or Ops ?

I have been working on optimising and fixing software applications in production well over a decade now, and thought i will start writing some of the practices that i believe to be aligned with known engineering principles. Let's call this series #Live to mean production systems. As the first post in this series, we will look into the monitoring strategy on production applications.  


The question i try to answer in this blog is

Should monitoring be developed as a feature (or) is it an operations activity post development ? As an architect what can you do to make sure the monitoring design is right.

TL;DR

  1. Setting up Monitors is a developer activity but should also keep DevOps in mind, and i would say should be built in the organisation DevOps strategy. An architect should work on a strategy where the monitoring infrastructure report metrics and failures from code, and DevOps is on boarded to setup monitors by extending it. There should be a combination of both code level monitoring and process level monitoring. The ultimate idea is to make sure for every feature developed there is a way to monitor it reliably.

2. Monitoring should be running in developer machines and in lower environments to enable validating it during development life cycle and make sure it's as robust as the feature delivered. So in that sense it's a feature built into the  story / epic you are delivering to customer.

Details

I will go through my detailed thought process on the above summary. Let's start with why seeing it as a feature helps, then we will see what it means to develop monitoring as a feature and what are some of the options for .NET.

Why it should be a feature and built into code

In production there are n things to monitor. Some of them are

Application

  • Framework level metrics (ASP.NET, Kestrel, NHibernate, Heap), Code Instrumentation, Performance profiling, Logs & Errors

Dependencies

  • Database, Caching, Searching, Scheduling

Infrastructure

  • CPU, memory, Disk space etc

Our focus is none of these. All the above should exist in some (or) other form. There are enough tools available to moniter these and they do a fairly good job. But the  problem with them is they tell you  what the symptoms are, but they don't tell you what that symptom means to the user, because they report all the metrics outside the process. Some instrument at the runtime level and report, but won't suffice to tell the story to customer. An example scenario would be.

Alert : Memory usage is 90% in server us-raven-p01.domain.com

What is the impact to the customer ? Is this going to affect page load time, is it going to delay the notification it is suppose to send ? Or anything else.

  • To understand if the application is really doing the things that it is suppose to do , we need to report data from inside the system. Another example would be % of successful requests handled / hour. One can argue that this can be reported from the web server level, but it's more inference and not a fact to customer. Only the application can know what success would mean.
  • Inference based monitoring become obsolete over time because the application deployment configuration changes, the application itself changes, but the monitoring doesn't, and we will have overhead to setup /configure the agents and depend on external teams to make it work again.
  • The knowledge of the monitors stay separate from the knowledge of the code. The evolution of the monitors is not in parallel to evolution of code.
  • The monitoring infrastructure should be part of the developer setup and the developer should be able to setup and review the monitors as part of development. This will help establish the reliability of monitoring and allow monitoring even in lower environments with no effort.

Reporting it from code would tell a different story than the inferences. For example you can exactly report that % of successful requests based on the exceptions, success % on the database inserts, #exceptions before completing a request, confidence% on the customer requests / month etc. These are more customer centric metrics than infrastructure centric.

Infrastructure centric monitoring don't necessarily make monitoring complete. We have to compliment it with feature centric metrics, hereafter called as KPI's.

How do we develop monitoring as a feature

Server

Nowadays pretty much all the APM tools offer SDK to report

  • Custom Metrics
  • Custom Attributes
  • Custom Events

to the APM server and visualisation is supported to see these data in alignment with the other metrics. For example you can draw a graph between a custom metric and system memory. You should work with the APM vendor to setup custom monitoring. I have used new relic extensively and i have a github project on how it works.