This blog highlights a unique IBM i feature – First Failure Data Capture – and how IBM uses it to debug the POWER server intermittent errors. Chances are you have never hear of this capability. As best I know, nothing like this exists on x86 or non-IBM platforms.
Disclaimer: iSeries and AS400 are servers. IBM I is an operating system (an excellent one at that). I use these terms interchangeably to help folks find this kind of information on the web.
A question IBM POWER users rarely ask—unless they work on other platforms—is what is an intermittent error?
It’s a kind of bug that occurs without rhyme or reason. Your system may lock up, or worse, it may come to a complete halt. Windows users are very familiar with it—as well as users of non-IBM hardware—and it has been a major nuisance for years. But when it happens to business-class systems it can be much more than a nuisance. It can be very costly.
Back in June 1999, eBay crashed for several hours. The culprit turned out to be electromagnetic radiation from sunspots “killed” the cache memory in eBay’s SUN servers. SUN was aware of the faulty hardware before the incident but chose to respond on a “break-fix” basis…so they did not get around to fixing eBay’s Sun servers until after they crashed. Oops.
I cannot remember the last time this happened—if ever—on a POWER server.
The reason why IBM users have not suffered from intermittent errors is because of a feature called First Failure Data Capture (FDDC), a capability built into all IBM POWER servers.
Many do not remember, but in the early 1980s, IBM had intermittent problems pop up in their mainframes as they transitioned from the older circuits and memory of the 360/370 to the semiconductor components of the 43XX. To determine what was going on, the IBM scientists developed a comprehensive way to track every bit as it moved from register to register. What is really elegant is that IBM logs these bugs as they occur. That’s pretty impressive when you think that the bits fly well in excess of 4 GB per second. Without this real-time logging technique, it is impossible to isolate—and re-create—an “incident.” That’s because the system crashes before you blink.
This architecture, dubbed First Failure Data Capture, was developed specifically to “trap” intermittent bugs, enabling IBM scientists to reproduce them and to redesign their circuitry to eliminate them. Nothing less than a bit-by-bit audit trail would have allowed them to accomplish this.
This design proved to be so successful—and so critical to data integrity—that IBM adopted this architecture throughout its entire POWER family of servers.
Many people are unaware that IBM built the motherboards for SUN and HP, as well as several other computer makers. So IBM got to see what their competitors were planning in future releases. One IBM scientist, speaking in 2008, said that as far as he could tell, First Failure Data Capture would remain an IBM exclusive feature for the next 7 years or more because nothing like this has been added to HP or SUN mother board designs for the foreseeable future. That’s because it takes at least that long to incorporate this capability into an existing architecture for future release. Said differently, if SUN, HP, or anyone else ever plans to offer something like First Failure Data Capture, it will take them 7–10 years before they would be in a position to give IBM a run for its money.
If First Failure Data Capture is such an important feature, how come you probably never heard of it before? Any parent can guess why: the “bad” child gets most of the attention.
Need help with your IBM i?
Email me at blosey@source-data.com or call me at 714-593-0387.
Leave a Reply