## Availability

Data Centre Infrastructure

### 1 Reliability

Any system, operated long enough, will experience failures. We let the reliability, $R$ as a function of the time $t$ that the system has been operated since it started to be:

$\begin{array}{lll}\hfill R\left(t\right)& ={e}^{-\mathit{\lambda t}}\phantom{\rule{2em}{0ex}}& \hfill \text{(1)}\phantom{\rule{0.33em}{0ex}}\end{array}$

This equation suggests that as a particular system is running for an increasing period of time, the likelihood of it not experiencing a failure falls, Equation 1, according to the parameter $\lambda$. We call $\lambda$ the failure rate.

Figure 1: Reliability as a function of time

#### 1.1 Failure rate

Assuming that a system on average will have ${N}_{f}$ failures, observed over a total time of ${T}_{p}$, we define the failure rate to be:

$\begin{array}{lll}\hfill \lambda =\frac{{N}_{f}}{{T}_{p}}& \phantom{\rule{2em}{0ex}}& \hfill \text{(2)}\phantom{\rule{0.33em}{0ex}}\end{array}$

Meaning that $\lambda$ has units of inverse time.

Example 1 (Failure rate). A system’s manager recorded 1 failure in a 4 year period. What is the system’s failure rate per year?

$\begin{array}{lll}\hfill \lambda & =\frac{{N}_{f}}{{T}_{p}}\phantom{\rule{2em}{0ex}}& \hfill \text{(3)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\frac{1}{\text{}4\text{}\phantom{\rule{0.3em}{0ex}}\text{}\mathrm{year}\text{}}\phantom{\rule{2em}{0ex}}& \hfill \text{(4)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\text{}0.25\text{}\phantom{\rule{0.3em}{0ex}}\text{}\mathrm{yea}{r}^{\text{}\text{−}1\text{}}\text{}\phantom{\rule{2em}{0ex}}& \hfill \text{(5)}\phantom{\rule{0.33em}{0ex}}\end{array}$

### 2 Bathtub curve

In practice, the failure rate is elevated at the beginning and end of a system’s life, following the so-called bathtub curve, section 2. The failure rate of any system normally excludes these particular periods and can be assumed to be constant for a particular system. Figure 2: Bathtub curve

### 3 Mean time between failures

The Mean Time Between Failures (MTBF) is a reciprocal of the failure rate, and has units of time:

$\begin{array}{lll}\hfill \text{MTBF}=\frac{1}{\lambda }& \phantom{\rule{2em}{0ex}}& \hfill \text{(6)}\phantom{\rule{0.33em}{0ex}}\end{array}$

It is generally accepted that the mean time between failures relates only to the middle portion of the so-called “bathtub curve” of reliability, section 2.

Example 2 (Mean time between failures). A system on average fails on two occasions in five years. Calculate the mean time between failures.

$\begin{array}{lll}\hfill \text{MTBF}& =\frac{1}{\lambda }\phantom{\rule{2em}{0ex}}& \hfill \text{(7)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\frac{1}{\frac{{N}_{f}}{{T}_{p}}}\phantom{\rule{2em}{0ex}}& \hfill \text{(8)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\frac{{T}_{p}}{{N}_{f}}\phantom{\rule{2em}{0ex}}& \hfill \text{(9)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\frac{5}{2}\phantom{\rule{2em}{0ex}}& \hfill \text{(10)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\text{}2.5\text{}\phantom{\rule{0.3em}{0ex}}\text{}\mathrm{year}\text{}\phantom{\rule{2em}{0ex}}& \hfill \text{(11)}\phantom{\rule{0.33em}{0ex}}\end{array}$

### 4 Mean time to repair

Assuming that a failure has occured, it normally requires a repair time, during which the system is unavailable. Averaged, we say that a particular type of failure has a Mean Time To Repair (MTTR).

### 5 Inherent availability

The inherent availability of a system tells us for what percentage of time it is likely to be available. It is based on two ideas:

1. The system is available between failures, which should occur at intervals of the MTBF.
2. When a failure occurs, it will be unavailable for time taken to repair, ie the MTTR.

So, the availability is essentially determined by how often a repair is needed and how long it takes. Knowing the MTBF and MTTR, we can estimate the inherent availability, ${A}_{i}$ of a system:

$\begin{array}{lll}\hfill {A}_{i}& =\frac{\text{MTBF}}{\text{MTBF}+\text{MTTR}}\phantom{\rule{2em}{0ex}}& \hfill \text{(12)}\phantom{\rule{0.33em}{0ex}}\end{array}$

Example 3 (Inherent availability). A system has an MTBF of 24 days and an MTTR of 12 hours. Calculate the inherent availability. Assuming we take a base unit of days, so that 12 hours = 0.5 days.

$\begin{array}{lll}\hfill {A}_{i}& =\frac{24}{24+0.5}\phantom{\rule{2em}{0ex}}& \hfill \text{(13)}\phantom{\rule{0.33em}{0ex}}\\ \hfill & =\text{}98\text{}\phantom{\rule{0.3em}{0ex}}\text{}%\text{}\phantom{\rule{2em}{0ex}}& \hfill \text{(14)}\phantom{\rule{0.33em}{0ex}}\end{array}$

### 6 Operational availability

The opearational availability of a system extends the inherent availability to incorporate scheduled maintenance downtime.

### 7 N

We say that to provide a particular function, we need a particular number of units $N$.

### 8 Common patterns

#### 8.1 N+1

If we need $N$ units, we add an additional unit to cover failures. The additional unit is denoted $+1$. This gives us the designation $N+1$.

#### 8.2 2N

If we need $N$ units, we duplicate each individual unit giving us the designation $2N$.

Note that in practice sometimes the $2N$ and $N+1$ configurations lead to the same number of units, but usually one designation will be preferred over another to convey meaning.