The Biggest Reliability Problem Isn’t Equipment Failure — It’s Defect Correction

Reliability Strategy

Why Fixing Faster Is Not the Same as Fixing Fewer Times

“Every corrective action that does not eliminate the root cause is nothing more than a scheduled re-failure.”

I pulled up the work order history on a horizontal centrifugal pump during a plant assessment some time ago. Four corrective work orders in eleven months. Bearing replacements, all of them. The maintenance crew had gotten fast at it — sixteen-hour average turnaround, down from twenty-two the year before. And the worst part? The plant manager called it a maintenance success story.

I called it a problem that had never been solved.

The bearing specification was correct. The parts were the right parts. The crew knew the steps. But nobody had ever measured the shaft alignment on that pump. Nobody had verified the lubrication interval against the actual operating speed and temperature. Nobody had documented the failure mode in enough detail to notice that each bearing was failing in the same location, in the same way, at intervals that pointed to a mechanical forcing function.

This is the pattern I encounter most often in reliability assessments: not equipment that fails, but equipment that keeps failing, and organizations that have built a capable system for managing that failure without ever asking whether the failure itself should continue to happen.

The Defect Cycle Most Plants Live In

In reliability engineering, the foundational finding from Nowlan and Heap’s 1978 study — conducted for the U.S. Department of Defense using commercial aviation maintenance data — is that 89 percent of component failures have no statistical relationship to age. They are not governed by calendar or operating hours in the way that traditional time-based maintenance assumes. They are driven by conditions: installation quality, operating environment, lubrication practice, and the precision — or imprecision — of the last person who worked on that asset.

This matters because it means many plants are executing corrective maintenance against failure modes they have never formally identified, on equipment they have never formally assessed for failure risk. The correction happens. The root cause does not change. And the failure returns.

89%

of component failures have no statistical relationship to age (Nowlan & Heap, 1978)

$125K/hr

average cost of unplanned downtime reported by 3,215 plant decision-makers (ABB, 2023)

80 / 5

80% of corrective labor hours spent on fewer than 5% of assets — year after year

The underlying pattern is consistent across industries: a small fraction of assets drives a disproportionate share of corrective work. That is not bad luck. That is a defect that has never been eliminated.

Sociologist Diane Vaughan described this institutional drift as the normalization of deviance — the process by which small deviations from established practice become accepted as normal when they do not immediately cause catastrophic outcomes. A technician who skips the alignment check because “we’ve never had a problem” is doing something that feels reasonable in the moment. Repeated across a crew, across months, it becomes the standard. The failure that follows is not a surprise. It is the predictable output of a practice that drifted from its engineering basis.

The problem is not that equipment fails. The problem is whether the corrective work that follows changes anything about why it failed.

Two Types of Correction

Every plant corrects defects. The question is whether the correction addresses the symptom or the failure mode.

A symptom correction gets the asset back online. It satisfies the work order. It improves mean time to restore. But if the root cause has not been addressed — the wrong lubricant specification, the imprecise clearance on reassembly, the unexamined alignment that generates abnormal bearing load — the asset will return to the work queue on a predictable timeline.

A failure mode correction eliminates the defect. It requires someone to ask why, to document the finding, and to change something — a procedure, a specification, a training requirement, a part standard — to ensure the failure mechanism is removed. That work order does not come back.

Symptom Correction	Defect Elimination
Fix and return to service	Fix and prevent recurrence
Speed of restoration is the success metric	Reduction in repeat failures is the success metric
Work order closes when the asset restarts	Work order closes after root cause is documented
Root cause assumed or dismissed	Root cause identified and formally acted on
Same failure returns in 6–18 months	Failure mechanism is removed from the system
Maintenance backlog trends upward over time	Maintenance backlog trends downward over time

The difference between these two approaches is not primarily a matter of technical skill. It is a matter of organizational expectation. When the dominant measure of maintenance performance is speed of restoration, the team optimizes for speed. When the measure includes recurrence rate and root cause closure, the team optimizes for elimination. Both are rational responses to the metrics and incentives they are given.

The Framework We Keep Coming Back To

Turning a plant from a symptom-correction culture to a defect-elimination culture requires more than intent. It requires structure — a repeatable way of identifying defects, diagnosing their sources, executing corrections to a design standard, and closing the loop in a way that actually sticks. The following six practices form the foundation of the approach we return to with every plant we work with. This is not a program with a launch date and an end date. These are disciplines that build on each other and compound over time.

Defect Visibility

Before you can eliminate defects, you need to see them. Most plants carry invisible backlog — equipment conditions that operators have noticed, supervisors have worked around, and nobody has formally captured. A defect register, even a basic one, changes this dynamic. It creates a shared, prioritized view of the gap between current equipment condition and design standard. It gives operations a formal channel to surface what they know. And it gives reliability and maintenance teams the data to direct corrective effort before the next failure, not after. Defects that are visible can be prioritized. Defects that live only in informal conversation cannot.

Root Cause Discipline

Not every defect requires a formal root cause analysis. But the ones that repeat do — without exception. Root cause discipline means building the organizational habit of asking why as a standard part of corrective work, not as an exceptional investigation reserved for catastrophic events. In practice, this means capturing failure mode data in the CMMS and APM software, reviewing repeat failures on a structured cadence, and consistently distinguishing between the symptom and the mechanism. A bearing that fails repeatedly is a symptom. The mechanism might be shaft misalignment, an incorrect lubrication specification, or contamination ingress at the seal. The mechanism is what you eliminate. Treating the symptom simply reschedules the next failure.

Precision Execution

How corrective work is performed matters as much as the decision to perform it. A bearing installed with the wrong interference fit, on a shaft with an unverified alignment, with a lubricant that does not match the operating conditions, will fail on a predictable schedule — not because the technician was careless, but because the execution did not meet the engineering standard required for that asset to perform. Precision maintenance is not about perfectionism. It is about executing work to a documented, design-based standard: written procedures, calibrated tools, and technicians who know the tolerance ranges that separate spec-compliant from close enough. Most technicians want to work to a standard. What is often missing is the standard itself, or the scheduled time and proper equipment to meet it.

Workforce Capability

Knowledge and skill are not the same thing. A technician can understand that shaft alignment matters and still lack the hands-on repetition to execute a laser alignment correctly under field conditions — on a schedule, on an asset that is behind on production, with a coupling still warm from the previous shift. Capability develops through deliberate practice, structured feedback, and verified competency, not through a training session followed by immediate deployment. Skill gaps are not a character problem; they are a structural one. Plants that close those gaps deliberately — through mentored practice, field verification of technique, and honest assessment of where each person actually stands against a documented standard — are the ones whose defect rates genuinely move.

Cross-Functional Ownership

Defects enter a plant through multiple functions. Operations may run equipment outside design parameters. Maintenance may execute work imprecisely. Engineering may specify components that do not match the operating context. Procurement may substitute parts without verifying design equivalence. No single function can address this alone. Durable defect elimination requires operations, maintenance, reliability engineering, and plant leadership to share a common set of metrics around equipment condition — not just production output and corrective work count. When reliability becomes a shared business responsibility rather than a maintenance department metric, the conversations that actually prevent failures become possible.

Learning Systems

Every failure is data. Every corrective action is an observation. When work history is captured with sufficient detail — failure mode, probable cause, condition at time of repair — patterns become visible across assets and over time. When it is not — when work orders close with “replaced component, returned to service” — the organization learns nothing, and the next identical failure starts from zero. CMMS discipline is not administrative overhead. It is the mechanism by which individual experience becomes institutional knowledge, and institutional knowledge becomes fewer failures. The loop between corrective action and documented root cause closure is exactly where most reliability improvements either compound or collapse.

What This Looks Like in Practice

Plants that work through this framework do not see results in a quarter. The pattern we observe consistently is a 12-to-24-month arc before the shift becomes self-sustaining.

Months 1–6

The uncomfortable truth

Defect registers surface how much backlog has been invisible. Root cause investigations take time that reactive schedules cannot easily give. Precision standards feel demanding against a history of getting by. CMMS data looks bad because it is finally honest.

Month 12

The shift begins

The small group of assets that consumed most corrective labor hours starts to shrink as the highest-frequency failure modes are addressed one by one. Planners have more wrench time to allocate. Fewer emergency calls pull crews off scheduled work. Operators begin reporting equipment conditions earlier, because the defect register has shown them that their observations matter and lead to action.

Months 18–24

Self-sustaining

The maintenance backlog has a different composition — more proactive work, less reactive work. The cost of running assets begins to fall, not because maintenance spending was cut, but because fewer corrections are needed. Reliability metrics — MTBF, OEE, unplanned downtime rate — begin to reflect a plant that is genuinely improving, not just responding faster.

The question most reliability programs are designed to answer is: how do we respond to failures better? That is not a bad question. But the highest-value question — the one that changes a plant’s trajectory over a three-to-five-year horizon — is different: are we eliminating the failure modes that generate this work, or are we simply doing the same work faster? Defect correction, executed reactively and without root cause closure, is not a path out of the reliability problem. It is the reliability problem. The plants that break out of the reactive cycle do so by building the discipline to ask, after every significant failure: what has to change so that this is the last time?

Every plant I have walked through has people who know exactly where the problems are. They can name the three pumps that run rough, the heat exchanger that leaks at every shutdown, the gearbox that burns through seals on a schedule. The knowledge exists on every floor of every facility I have visited. What is often missing is the structure to act on it — the time, the documented procedures, the organizational expectation that the job is not finished when the asset restarts, but when the root cause is closed. That shift — from “fix it again” to “fix it for the last time” — is available to every plant willing to build the systems that make it possible.

Yoann Urruty, Eng., CMRP

Yoann is Montreal’s Office Manager at Reliability Solutions and brings over 20 years of specialized expertise in industrial reliability and maintenance engineering. He began his career as a Reliability Specialist and Trainer before advancing to Manager of Reliability Engineering, where he spent nearly a decade developing strategic reliability and maintenance optimization programs. As Director of Technologies, Yoann led digital product simulation initiatives and CFD engineering projects. He now leverages his deep technical background at Reliability Solutions, focusing on criticality analysis, operational maintenance optimization, and building comprehensive master data and scheduling systems for industrial organizations.