HootSuite: Measure Everything
Many of these graphs are displayed on large TV screens in the HootSuite office. (Obviously not in the lobby where the public would see them. But in the internal parts of the office.)
To use an adage, a picture is worth a thousand words, so here are some of those reports....
APIs, Performance & Exceptions
The ops team at HootSuite has a number of graphite reports they use. The one above is currently one of their favorites.
Since HootSuite interacts with a number of 3rd party APIs (such as Twitter, Facebook, Google+, Klout, Foursquare, etc etc), it's important to monitor their "health" of these APIs.
In addition to this, things such as page loads and page execution times, in the HootSuite web application, are measured. As well as monitoring various exceptions that the HootSuite application code may throw.
This report, with these graphs, let ops (and others who look at the screen), at a glance, get a feel for (some of) "what's happening".
And when you look at this graph often, you get a feel for what normal looks like, and when something unusual or bad is happening.
Computer & Operating System Level Stats
This report contains graphs about computer-level and operating system-level stats.
For example, information about disks, CPUs, memory, networks and load average.
Comparing Days Against Each Other
This type of graph compares the graph of the same stat, against itself for a weeks worth of time.
For example, in the graph above, Twitter API usage is shown.
Here is another similar graph, this one for Facebook API usage.
And yet another one for page load activity:
This overlap technique can be tweaked. Perhaps day of the week matters, and in one graph you want to compare the last 5 Mondays against each other; and in another graph you want to compare the last 5 Tuesdays against each other; etc.
Yet More Graphs
A lot can be gained by recording statistics like this and not only creating these kinds of graphs, but making them amenable to passive inspection, by putting them on large screen TVs in the office. But there is more that can be done.
Although having these kinds of reports on screens where everyone can see them helps a lot!, people aren't paying attention to these reports every second of the day. On top of that people aren't in the office (or even awake) every second of the day, where they could potentially be paying attention to these screens.
Having automated systems that monitor these statistics, which have a sense of what normal looks like, and which are able to automatically detect when something outside of the range of normal happens.
Such a system could simply send an alert. (Perhaps hooking into Nagios or simply sending an e-mail.) In addition to this, in some cases, it could try to deal with or fix the problem it detects.
On "Measuring Everything"
Now, I should say that recording stats and displaying them in graphs is not a new thing. Back in 2004, I had this type of thing. (I'd be surprised if systems like this didn't exist before that date.) But back then, we created the system ourselves, from scratch. But now, there's already existing free and open source tools that save you the trouble of having to roll your own solution.