Infra Uptime Calculations   Leave a comment

SYSTEM UPTIME / AVAILABILITY CALCULATATIONS

Reliability — the probability that the product (system) will perform its intended function for a specified time period when operating under normal (or stated) environmental conditions
Availability — the probability that a server will be perform its intended function under normal operating conditions when needed, usually expressed as a percentage.

Quantitative Calculations of Availability

AvailabilityIgnoring the possibility of degraded states for now, let’s look at availability as if it equates equally to system uptime. This is usually expressed as the number of nines in the percentage of uptime calculation. For example, five nines equates to 99.999% uptime which is approximately 5 min and 15 sec of downtime per year!

There are a number of simple ways to calculate system uptime such as Microsoft’s uptime.exe tool, running systeminfo.exe at a command prompt or just looking at Task manager on Vista or 2008 server. However, what you really need to do to track uptime percentages is to track the amount of downtime within a specific period.

The easiest way to calculate your uptime or “Nines” rating is as follows:

Availability = (Uptime / (Uptime +Downtime)) x 100

OR

Availability = (Elapsed Time – Downtime) / Elapsed Time) x 100

For example, in the last 30 days (43, 200 minutes) I have a system that has been online and available for 29 days 23 hours and 34 minutes (43,174 minutes). In other words it has been offline for 26 minutes within in a 30 day (43,200 min) period due to an unscheduled outage:

Elapsed Time (Minutes in Month) =43200

Downtime Minutes = 26

Availability = (43,174/(43174 + 26)) x 100 = 99.9%

or

Availability = ((43,200 – 26)/43,200) x 100 = 99.9%

Alternatively, you can just plug your numbers in to this handy uptime calculator.

Qualitative Calculations of Availability

Quantifying uptime, although still a chore, is fine when viewing uptime as a binary state in which you have only 2 possible scenarios but how do you determine the availability of a sluggish email server or even a SQL server cluster that keeps failing back and forth between nodes? This is where it gets subjective and since every system is different, I don’t think there are any set rules or standards. The best course of action is probably to obtain service baselines to determine a server’s quality of service so that a degraded service can be measured against that baseline. For example, say we have a system that normally handles a 100 transactions per second but in the last 30 days, there was a 24 hour period where it was only handling 25 transactions per second. However, a reboot (which completed in 5 minutes) resolved the problem. In this scenario, I think it is fair to assume it was 75% degraded or 25% available until the reboot. The uptime calculation for the month may be as follows:

Availability =

((Elapsed Time – (Degraded time x Degraded %) – Downtime) / Elapsed time) x 100

Availability = ((43,200 – (1440 x .75) – 5)/43200) x 100 =97.5%

In the end, the real difficulty is not the calculation but quantifying the degraded state.

Cost vs. Availability

What continuously amazes me however, are the woefully underfunded IT departments with specific, yet unobtainable uptime metrics. The unfortunate Systems Administrator’s in these shops are doomed to failure or at least to many sleepless nights. I’m not saying that 3, 4 or even 5 nines is impossible, it just requires an incredibly robust infrastructure and can be heavily influenced by “how” uptime is calculated. In the end, you must weigh the cost of downtime vs the cost of building out the network and server infrastructure for high availability. In other words, as expected downtime decreases, costs increase dramatically if not exponentially. For a fantastic article describing the difficulties of achieving high nines, you may want to read the article (or point your CTO to) “Five Nines: Chasing the Dream?” in order to better ground them in uptime reality. In the end, you have to come to an agreement with IT management on what constitutes “Downtime” so that you can properly calculate your uptime

Posted May 17, 2014 by g6237118

Leave a comment