Reliability Strategy
Why Fixing Faster Is Not the Same as Fixing Fewer Times
“Every corrective action that does not eliminate the root cause is nothing more than a scheduled re-failure.”
I pulled up the work order history on a horizontal centrifugal pump during a plant assessment some time ago. Four corrective work orders in eleven months. Bearing replacements, all of them. The maintenance crew had gotten fast at it — sixteen-hour average turnaround, down from twenty-two the year before. And the worst part? The plant manager called it a maintenance success story.
I called it a problem that had never been solved.
The bearing specification was correct. The parts were the right parts. The crew knew the steps. But nobody had ever measured the shaft alignment on that pump. Nobody had verified the lubrication interval against the actual operating speed and temperature. Nobody had documented the failure mode in enough detail to notice that each bearing was failing in the same location, in the same way, at intervals that pointed to a mechanical forcing function.
This is the pattern I encounter most often in reliability assessments: not equipment that fails, but equipment that keeps failing, and organizations that have built a capable system for managing that failure without ever asking whether the failure itself should continue to happen.
The Defect Cycle Most Plants Live In
In reliability engineering, the foundational finding from Nowlan and Heap’s 1978 study — conducted for the U.S. Department of Defense using commercial aviation maintenance data — is that 89 percent of component failures have no statistical relationship to age. They are not governed by calendar or operating hours in the way that traditional time-based maintenance assumes. They are driven by conditions: installation quality, operating environment, lubrication practice, and the precision — or imprecision — of the last person who worked on that asset.
This matters because it means many plants are executing corrective maintenance against failure modes they have never formally identified, on equipment they have never formally assessed for failure risk. The correction happens. The root cause does not change. And the failure returns.
The underlying pattern is consistent across industries: a small fraction of assets drives a disproportionate share of corrective work. That is not bad luck. That is a defect that has never been eliminated.
Sociologist Diane Vaughan described this institutional drift as the normalization of deviance — the process by which small deviations from established practice become accepted as normal when they do not immediately cause catastrophic outcomes. A technician who skips the alignment check because “we’ve never had a problem” is doing something that feels reasonable in the moment. Repeated across a crew, across months, it becomes the standard. The failure that follows is not a surprise. It is the predictable output of a practice that drifted from its engineering basis.
The problem is not that equipment fails. The problem is whether the corrective work that follows changes anything about why it failed.
Two Types of Correction
Every plant corrects defects. The question is whether the correction addresses the symptom or the failure mode.
A symptom correction gets the asset back online. It satisfies the work order. It improves mean time to restore. But if the root cause has not been addressed — the wrong lubricant specification, the imprecise clearance on reassembly, the unexamined alignment that generates abnormal bearing load — the asset will return to the work queue on a predictable timeline.
A failure mode correction eliminates the defect. It requires someone to ask why, to document the finding, and to change something — a procedure, a specification, a training requirement, a part standard — to ensure the failure mechanism is removed. That work order does not come back.
| Symptom Correction | Defect Elimination |
|---|---|
| Fix and return to service | Fix and prevent recurrence |
| Speed of restoration is the success metric | Reduction in repeat failures is the success metric |
| Work order closes when the asset restarts | Work order closes after root cause is documented |
| Root cause assumed or dismissed | Root cause identified and formally acted on |
| Same failure returns in 6–18 months | Failure mechanism is removed from the system |
| Maintenance backlog trends upward over time | Maintenance backlog trends downward over time |
The difference between these two approaches is not primarily a matter of technical skill. It is a matter of organizational expectation. When the dominant measure of maintenance performance is speed of restoration, the team optimizes for speed. When the measure includes recurrence rate and root cause closure, the team optimizes for elimination. Both are rational responses to the metrics and incentives they are given.
The Framework We Keep Coming Back To
Turning a plant from a symptom-correction culture to a defect-elimination culture requires more than intent. It requires structure — a repeatable way of identifying defects, diagnosing their sources, executing corrections to a design standard, and closing the loop in a way that actually sticks. The following six practices form the foundation of the approach we return to with every plant we work with. This is not a program with a launch date and an end date. These are disciplines that build on each other and compound over time.
What This Looks Like in Practice
Plants that work through this framework do not see results in a quarter. The pattern we observe consistently is a 12-to-24-month arc before the shift becomes self-sustaining.
The question most reliability programs are designed to answer is: how do we respond to failures better? That is not a bad question. But the highest-value question — the one that changes a plant’s trajectory over a three-to-five-year horizon — is different: are we eliminating the failure modes that generate this work, or are we simply doing the same work faster? Defect correction, executed reactively and without root cause closure, is not a path out of the reliability problem. It is the reliability problem. The plants that break out of the reactive cycle do so by building the discipline to ask, after every significant failure: what has to change so that this is the last time?
Every plant I have walked through has people who know exactly where the problems are. They can name the three pumps that run rough, the heat exchanger that leaks at every shutdown, the gearbox that burns through seals on a schedule. The knowledge exists on every floor of every facility I have visited. What is often missing is the structure to act on it — the time, the documented procedures, the organizational expectation that the job is not finished when the asset restarts, but when the root cause is closed. That shift — from “fix it again” to “fix it for the last time” — is available to every plant willing to build the systems that make it possible.
Yoann Urruty, Eng., CMRP
Yoann is Montreal’s Office Manager at Reliability Solutions and brings over 20 years of specialized expertise in industrial reliability and maintenance engineering. He began his career as a Reliability Specialist and Trainer before advancing to Manager of Reliability Engineering, where he spent nearly a decade developing strategic reliability and maintenance optimization programs. As Director of Technologies, Yoann led digital product simulation initiatives and CFD engineering projects. He now leverages his deep technical background at Reliability Solutions, focusing on criticality analysis, operational maintenance optimization, and building comprehensive master data and scheduling systems for industrial organizations.
