A lot is spoken about reliability – it’s a common topic of conversation in the pub or over coffee, often as a result of a bad experience (poor reliability). It’s a term that is relates very much to quality but is possibly easier to define and quantify. Unfortunately, the numbers that come out of any quantification are rarely much comfort when your widget was one of the small percentage of ones to fail. It’s not even as comforting as seeing somebody else win the lottery jackpot: your numbers had the same chance of winning, but didn’t (at least, with the lottery, you’re in the majority).

I’ve previously spoken about customer satisfaction, where it is dependent on the customer’s perception of what you have supplied (whether a product or a service) being at least up to his expectations. A knowledgeable customer, buying in bulk, will probably expect a percentage of failures – he would like not to have any but he understands that no process is perfect and the closer to perfection you strive, the more the cost is likely to mount up. But the knowledgeable customer will also better know what should be possible.

On simple products, reliability may be a simple binary situation – it works or it’s broken. And that’s often how we think of it but, in quality circles, being aware of the failure potential and its often probabilistic nature means we need not be confined to the binary situation. That’s certainly what has to be done in more complex products and services, where we start to think about series and parallel systems.

Series: where the failure of any part causes failure of the whole. Consider the tyres on a car’s road wheels – one puncture and the car is unusable.

Parallel: where failure of a single part is survivable. Consider the nuts holding on a car’s road wheel – if one comes loose, the car can still be driven (though not with impunity as the reason for one nut to come loose may also lead to a second or third or forth coming loose)…

There are some failures that can be accommodated through, say, redundancy and others that are “mission critical”. Needless to say, the latter tend to require the greater attention in operation. When looking at such systems, if we can estimate the probability of each failure we can calculate the overall reliability. Two formulae (and these are very much simplified):

Series: if the p1 is the probability of failure of part 1, p2 that of part 2, etc, the reliability of a system with 4 parts is:

((1-p1)x(1-p2)x(1-p4)x(1-p4))

Parallel: if the above parts are working in parallel (such that as long as one remains working the system will remain working, the reliability is:

1-(p1xp2xp3xp4)

In the real world, it now starts to get very complicated because systems are not either/or, they’re usually a combination (or, in fact, numerous combinations). Neither are the probabilities of failure often known to any great accuracy, so we need experimental data and distribution tables – which moves well beyond a blog piece. For now, just realising what has to be, or can be, considered can help, so I’ll stop here.