Let’s Be Frank: Failed Failure Analysis

By Inspector Frank. April 29, 2021

Editor’s Note:  Writing under the pseudonym Inspector Frank, the author of this column offers a first-hand, candid view of what he has witnessed throughout his career. His purpose in sharing these experiences and opinions is to encourage readers to think deeper about what they do, why they do it, and the possible impact of their decisions.

Inspectioneering is committed to protecting the anonymity of pseudonymous authors. We do, however, hold these contributors to the same editorial standards as those writing under their own name. In this, we know the author’s identity and maintain communications regarding the author’s published works. If you have any questions, feedback, or concerns stemming from this article, please send an email to and we will forward your correspondence to the appropriate party.

In the first article I wrote for Inspectioneering back in June of 2019 entitled “Let's Be Frank: What Went Wrong?” I talked about how having a significant incident at your facility can really help hone an integrity group. One of the elements I mentioned briefly in that article was the investigation into what went wrong.

When an incident (large or small) occurs, most companies will perform some kind of root cause analysis (RCA). I specifically want to talk about the investigations around a loss of containment (LOC) incident, as opposed to say a personnel safety incident or a process control incident.

In other words, what is the follow-up to an event where the process no longer stayed in the pipes.

For those of you not as familiar with the process of RCA, it is basically a systematic process for identifying “root causes” of events. In its most distilled form, it is a way to determine which factors are root causes and not just symptoms. RCA is meant to drill down to help establish what the underlying problems are that led to the incident. The term denotes the earliest, most basic cause for a given problem; in this case a loss of containment. The idea is that you can only see an error by its manifest signs (or symptoms).

That information should then be used to look at ways to manage your systems to eliminate or safeguard against those root causes, most commonly called corrective actions.

In its most basic form, an LOC RCA has the following steps:

  • Step 1. Clearly identify the LOC being analyzed.
  • Step 2. Gather data around the LOC, including generating a timeline of events, understanding failure mechanics/corrosion mechanisms, pulling out process operations history, etc.
  • Step 3. Analyze the data and timeline information, with the idea being to drill down through the data and timeline to establish the ‘roots’ of the LOC.
  • Step 4. Clearly state what the root cause(s) are.
  • Step 5. Develop mitigation strategies (corrective actions) to correct the root cause(s).

RCA around an LOC is basically an events and causal factor analysis: this process uses evidence gathered quickly and methodically to establish a timeline for the activities leading up to the incident. This can take many forms and there are many tools out there that can be used to perform the analysis portion. I am sure most of you have worked with one of them at one time or another.

Some of the more common methods of performing RCAs will utilize one or more of the following tools; 5 Whys, Fishbone Diagram (also known as an Ishikawa Diagram), Why Tree Diagram, etc.

Over the past 20 years or so I have watched many companies get significantly better at performing root cause analyses and using the information gathered to make appropriate corrective actions. However, I still see many companies struggling with identifying the correct root causes and developing appropriate corrective actions around LOCs that involve performing a metallurgical failure analysis (establishing the failure mechanism – be it mechanical or corrosion driven). Many facilities tend to want to take the quickest repair route, which can lead to key evidence being destroyed.

What do I mean? Let’s look at a scenario (maybe theoretical, maybe not) around a loss of containment and the following RCA that can help identify what I am referring to.

The refinery just had a loss of containment on some rich Diglycolamine (DGA) piping running from the absorber back to the amine gas treating unit (rich DGA refers to DGA that has already absorbed, and is therefore ‘rich’ in, hydrogen sulphide and carbon monoxide, as opposed to lean DGA which has not been used in the absorption process yet).

The unit inspector was out as soon as the line was purged, depressurized and isolated. The first reports and images show that the loss of containment is at a weld, and it looks like corrosion is preferentially attacking the weld metal (or potentially the fusion line). The amine gas treating system is now shutdown and some of the main processing units are now in reduced or idling operation. Everyone from operations to maintenance to the admin in the front office know the first priority is to get the plant running again quickly, but safely.

Some of you fine readers may already see where this is going…..

A team consisting of some in-house SMEs is formed to get to the bottom of this. Operations starts pulling process history off of the DCS, maintenance starts pulling all the packages for any work they have done on this piping circuit in the last ten years and the EI group starts pulling historic corrosion monitoring and inspection data.

The unit inspector sets up radiographic testing (RT) of the failed weld and decides to also shoot four welds upstream and four welds downstream of the failed location at the same time. He also starts getting some ultrasonic thickness testing (UT) of the pipes and elbows in the circuit, focusing on areas where there may be impingement or flow changes.

As the UT data starts coming back in, it corresponds to the historic trends that have been seen on this line for the past 30 years, nothing abnormal is found on the pipe or elbow segments. The RT does show that the weld seems to have been preferentially corroded, and there is less significant degradation found on some roots of other welds in the circuit. Why are welds being preferentially attacked?

The corrosion monitoring program (UT and RT) data and historic API 570 visual reports are reviewed and show no historic issues of note.

As the EI group ponders what further evidence and data they need, the operations manager comes by and says they need to get this circuit back up and running quickly as this is severely reducing overall plant throughput.

The unit inspector sets up some X-ray fluorescence positive material identification (XRF PMI) of the welds to see if the failed weld or the others showing minor corrosion have any chemical composition differences that may be the cause of the preferential weld corrosion. Some minor chemistry differences are noted in some welds, including the one that failed but no ‘smoking gun’ is found.

When investigating the work history, Maintenance found that the line was rerouted nine years ago. Unfortunately, the package was not properly completed, nor were the isometric drawings redlined, so they are not quite sure what welds may have been welded at that time as opposed to the original line installation 30 years prior. Process said they have had no operational upsets or changes to operation in the last five years (length they have DCS information easily available). Rich DGA lab sampling location history shows no changes in stream composition over the last 15 years.

The EI group asks for the failed weld to be cut out for further metallurgical failure analysis; specifically, that a section of pipe 2’ on either side of the weld be cut out. Engineering counters that being only the welds are showing corrosion and the associated piping is still sound, it would be quicker to gouge out the failed weld and reweld it as opposed to cutting out and replacing a whole section of pipe.

Speed of repair consideration wins and the evidence is arc gouged into oblivion. The line is hydrotested, passes, and operations start the process of reinstating the line and getting the facility back up and running.

Everyone is asked to gather and compile their data and prepare for an RCA that will start the following day. After going through a ‘why tree’ analysis no firm root cause is established. It is suspected that weld metal chemistry variations were the cause for the preferential weld corrosion attack. As such, the EI group will start monitoring welds on this line with RT, with the first set of shots being planned for six months from now.

Because improper weld metal is considered to be the main root cause, quality control (QC) systems around construction and repairs are to be reviewed and weld filler metal control is to be tightened up. As QC systems had been improving at the plant over the last five years, it was found this had already been accomplished.

The plant manager congratulated everyone for a fast response to the incident and a good investigation. However, two months later another weld failed on this line.

This time EI convinced everyone of the benefit of not destroying evidence by just gouging out the problem welds and got the failed weld sections of pipe cut out, as well as a few of the other welds that were showing preferential attack.

When these failure samples were sectioned it was found the pipe was being corroded only on the downstream side of the weld. It visually appeared that erosion/corrosion was happening downstream of the flow restriction that was created by the root of the weld protruding into the pipe as opposed to preferential corrosion attack of the weld metal itself. This made the original RCA ineffective because incorrect data was used. A new RCA was performed. With this new evidence of the damage mechanism being accounted for, very different root causes were found.

It was established that the line was rerouted nine years earlier as part of a capital project to increase throughput through the amine absorption units. This was done as part of an overall ‘debottlenecking’ project that was meant to increase crude throughput in the refinery. The pipe was rerouted and over the following three years other changes were made in the process units and the amine treating unit, including upsizing the absorber and DGA stripper vessels, to allow for a greater volume of hydrogen sulphide removal to occur, thereby increasing crude throughput. This operational change went active six years prior to the failure.

The greater volume of DGA going through the system increased fluid velocity in the rich DGA piping as the pipes were not upsized as part of the debottlenecking.

From API 571 (2020), Section 3.2.3 h): Process stream velocity will influence the amine corrosion rate and nature of attack. Corrosion is generally uniform; however, high velocities and turbulence will cause localized thickness losses. For carbon steel, velocities are generally limited to 3 fps to 6 fps (1 m/s to 2 m/s) for rich amine and about 20 fps (6 m/s) for lean amine.

By increasing throughput, velocity in the rich DGA lines had been increased to approximately 10fps. At this increased velocity the protruding roots of the welds on the interior of the pipe caused the rich DGA to eddy and swirl on the downstream side of the weld causing preferential erosion/corrosion in that location only. None of the changes in direction of the line or any other factors caused turbulence that was causing other erosion/corrosion issues. It was only occurring where there was more excessive root penetration in welds.

What was the new root cause you ask? Poor management of change (MOC) practices nine years earlier. When these equipment changes were made, the facility’s MOC processes were very weak and no inspection, metallurgical or corrosion SMEs were part of that process.

As a result of this new RCA, the following corrective and preventative actions were taken:

  1. The company's MOC process was revised to ensure all changes were reviewed by the equipment integrity group and allowed for engaging appropriate third-party SMEs if in-house personnel needed assistance.
  2. An engineering project was initiated to upsize the pipe diameter to reduce rich DGA flow velocities to no greater than 4 fps (approximately what the original line had been designed for).
  3. Until that project was complete, the EI group set up RT monitoring to keep an eye on the welds, focusing on the downstream side.
  4. A review of velocities in all other rich DGA circuits in the facility was started. This led to some further changes in inspection monitoring.

Like many processes, RCA is only as good as the information being used. If critical data about the failure mechanism is being lost during repairs, then this information will be unknown to the group trying to determine the root cause(s). This is one place where I still see many organizations doing poorly, as time and reactive management come together in a way that can destroy crucial evidence.

Performing RCA is important. Finding and correcting those root causes is a form of proactive management, as opposed to reactive management. However, as I outlined above, sometimes reactive management still rears its ugly head and has an effect on the evidence that could be gathered by proper metallurgical failure analysis.

I would recommend each of you take a look at how your organization deals with and investigates losses of containment. Could the scenario above happen to you?

Stay up to date with all things mechanical integrity and inspection.

Every Monday, we send out a newsletter containing the latest Inspectioneering articles, blog posts, industry news and events, and more.

Sign up below to start getting the newsletter in your inbox.

Comments and Discussion

Posted by Don MacIsaac on April 30, 2021
Thanks Frank. As a rule I try and get a failure... Log in or register to read the rest of this comment.

Posted by Christos Christoglou on May 17, 2021
Thanks for this article. In general, I would... Log in or register to read the rest of this comment.

Add a Comment

Please log in or register to participate in comments and discussions.

Inspectioneering Journal

Explore over 20 years of articles written by our team of subject matter experts.

Company Directory

Find relevant products, services, and technologies.

Training Solutions

Improve your skills in key mechanical integrity subjects.

Case Studies

Learn from the experience of others in the industry.


Inspectioneering's index of mechanical integrity topics – built by you.

Industry News

Stay up-to-date with the latest inspection and asset integrity management news.


Read short articles and insights authored by industry experts.

Expert Interviews

Inspectioneering's archive of interviews with industry subject matter experts.

Event Calendar

Find upcoming conferences, training sessions, online events, and more.


Downloadable eBooks, Asset Intelligence Reports, checklists, white papers, and more.

Videos & Webinars

Watch educational and informative videos directly related to your profession.


Commonly used asset integrity management and inspection acronyms.