Data Centre Infrastructure

Any system, operated long enough, will experience failures. We let the reliability, $R$ as a function of the time $t$ that the system has been operated since it started to be:

$$\begin{array}{lll}\hfill R\left(t\right)& ={e}^{-\mathit{\lambda t}}\phantom{\rule{2em}{0ex}}& \hfill \text{(1)}\phantom{\rule{0.33em}{0ex}}\end{array}$$This equation suggests that as a particular system is running for an increasing period of time, the likelihood of it not experiencing a failure falls, Equation 1, according to the parameter $\lambda $. We call $\lambda $ the failure rate.

Assuming that a system on average will have ${N}_{f}$ failures, observed over a total time of ${T}_{p}$, we define the failure rate to be:

$$\begin{array}{lll}\hfill \lambda =\frac{{N}_{f}}{{T}_{p}}& \phantom{\rule{2em}{0ex}}& \hfill \text{(2)}\phantom{\rule{0.33em}{0ex}}\end{array}$$Meaning that $\lambda $ has units of inverse time.

In practice, the failure rate is elevated at the beginning and end of a system’s life, following the so-called bathtub curve, section 2. The failure rate of any system normally excludes these particular periods and can be assumed to be constant for a particular system.

The Mean Time Between Failures (MTBF) is a reciprocal of the failure rate, and has units of time:

$$\begin{array}{lll}\hfill \text{MTBF}=\frac{1}{\lambda}& \phantom{\rule{2em}{0ex}}& \hfill \text{(6)}\phantom{\rule{0.33em}{0ex}}\end{array}$$It is generally accepted that the mean time between failures relates only to the middle portion of the so-called “bathtub curve” of reliability, section 2.

Assuming that a failure has occured, it normally requires a repair time, during which the system is unavailable. Averaged, we say that a particular type of failure has a Mean Time To Repair (MTTR).

The inherent availability of a system tells us for what percentage of time it is likely to be available. It is based on two ideas:

- The system is available between failures, which should occur at intervals of the MTBF.
- When a failure occurs, it will be unavailable for time taken to repair, ie the MTTR.

So, the availability is essentially determined by how often a repair is needed and how long it takes. Knowing the MTBF and MTTR, we can estimate the inherent availability, ${A}_{i}$ of a system:

$$\begin{array}{lll}\hfill {A}_{i}& =\frac{\text{MTBF}}{\text{MTBF}+\text{MTTR}}\phantom{\rule{2em}{0ex}}& \hfill \text{(12)}\phantom{\rule{0.33em}{0ex}}\end{array}$$

The opearational availability of a system extends the inherent availability to incorporate scheduled maintenance downtime.

We say that to provide a particular function, we need a particular number of units $N$.

If we need $N$ units, we add an additional unit to cover failures. The additional unit is denoted $+1$. This gives us the designation $N+1$.

If we need $N$ units, we duplicate each individual unit giving us the designation $2N$.

Note that in practice sometimes the $2N$ and $N+1$ configurations lead to the same number of units, but usually one designation will be preferred over another to convey meaning.