Dr Peadar Grant

Data Centre Infrastructure

1 Reliability

Any system, operated long enough, will experience failures. We let the reliability, R as a function of the time t that the system has been operated since it started to be:

R(t) = eλt (1)

This equation suggests that as a particular system is running for an increasing period of time, the likelihood of it not experiencing a failure falls, Equation 1, according to the parameter λ. We call λ the failure rate.

SVG-Viewer needed.

Figure 1: Reliability as a function of time

1.1 Failure rate

Assuming that a system on average will have Nf failures, observed over a total time of Tp, we define the failure rate to be:

λ = Nf Tp (2)

Meaning that λ has units of inverse time.

Example 1 (Failure rate). A system’s manager recorded 1 failure in a 4 year period. What is the system’s failure rate per year?

λ = Nf Tp (3) = 1 4year (4) = 0.25year1 (5)

2 Bathtub curve

In practice, the failure rate is elevated at the beginning and end of a system’s life, following the so-called bathtub curve, section 2. The failure rate of any system normally excludes these particular periods and can be assumed to be constant for a particular system.


Figure 2: Bathtub curve

3 Mean time between failures

The Mean Time Between Failures (MTBF) is a reciprocal of the failure rate, and has units of time:

MTBF = 1 λ (6)

It is generally accepted that the mean time between failures relates only to the middle portion of the so-called “bathtub curve” of reliability, section 2.

Example 2 (Mean time between failures). A system on average fails on two occasions in five years. Calculate the mean time between failures.

MTBF = 1 λ (7) = 1 Nf Tp (8) = Tp Nf (9) = 5 2 (10) = 2.5year (11)

4 Mean time to repair

Assuming that a failure has occured, it normally requires a repair time, during which the system is unavailable. Averaged, we say that a particular type of failure has a Mean Time To Repair (MTTR).

5 Inherent availability

The inherent availability of a system tells us for what percentage of time it is likely to be available. It is based on two ideas:

  1. The system is available between failures, which should occur at intervals of the MTBF.
  2. When a failure occurs, it will be unavailable for time taken to repair, ie the MTTR.

So, the availability is essentially determined by how often a repair is needed and how long it takes. Knowing the MTBF and MTTR, we can estimate the inherent availability, Ai of a system:

Ai = MTBF MTBF + MTTR (12)

Example 3 (Inherent availability). A system has an MTBF of 24 days and an MTTR of 12 hours. Calculate the inherent availability. Assuming we take a base unit of days, so that 12 hours = 0.5 days.

Ai = 24 24 + 0.5 (13) = 98% (14)

6 Operational availability

The opearational availability of a system extends the inherent availability to incorporate scheduled maintenance downtime.

7 N

We say that to provide a particular function, we need a particular number of units N.

8 Common patterns

8.1 N+1

If we need N units, we add an additional unit to cover failures. The additional unit is denoted + 1. This gives us the designation N + 1.

8.2 2N

If we need N units, we duplicate each individual unit giving us the designation 2N.

Note that in practice sometimes the 2N and N + 1 configurations lead to the same number of units, but usually one designation will be preferred over another to convey meaning.