A question IBM POWER users rarely ask—unless they work on other platforms—is what an intermittent error is?
It’s a kind of bug that occurs without rhyme or reason. Your system may lock up, or worse, it may come to a complete halt. Windows users are very familiar with it—as well as users of non-IBM hardware—and it has been a major nuisance for years. But when it happens to mainframe-class systems it can be much more than a nuisance. It can be very costly.
Several years ago, eBay crashed for several hours. The culprit turned out to be that electromagnetic radiation from sunspots had “killed” the cache memory in eBay’s Sun servers. Sun was aware of the faulty hardware before the incident but chose to respond on a “break-fix” basis…so they did not get around to fixing eBay’s Sun servers until after they crashed. Oops.
I cannot remember the last time this happened—if ever—on an POWER server.
The reason why IBM users have not suffered from intermittent errors is because of a feature called First Failure Data Capture (FDDC ), a capability built into all IBM POWER servers.
Many do not remember, but in the early 1980s, IBM had intermittent problems pop up in their mainframes as they transitioned from the older circuits and memory of the 360/370 to the semiconductor components of the 43XX. To determine what was going on, the IBM scientists developed a comprehensive way to track every bit as it moved from register to register. What is really elegant is that IBM logs these bugs as they occur. That’s pretty impressive when you think that the bits fly in excess of 4 GB per second. Without this real-time logging technique, it would have been impossible to isolate—and re-create—an “incident.” That’s because the system crashes before you could blink.
This architecture, dubbed First Failure Data Capture, was developed specifically to “trap” intermittent bugs, enabling IBM scientists to reproduce them and to redesign their circuitry to eliminate them. Nothing less than a bit-by-bit audit trail would have allowed them to accomplish this.
This design proved to be so successful—and so critical to data integrity—that IBM adopted this architecture throughout its entire POWER family of servers.
Many people are unaware that IBM built the motherboards for SUN and HP, as well as several other computer makers. So IBM got to see what their competitors were planning in future releases. One IBM scientist, speaking a few years ago, said that as far as he could tell, First Failure Data Capture would remain an IBM exclusive feature for the next 7 years because nothing like this has been added to HP or SUN mother board designs for the foreseeable future. That’s because it takes at least that long to incorporate this capability into an existing architecture for future release. Said differently, if SUN, HP, or anyone else, ever plans to offer something like First Failure Data Capture, it will take them 7–10 years before they would be in a position to give IBM a run for its money.
If First Failure Data Capture is such an important feature, how come you probably never heard of it before? Any parent can guess why: the “bad” child gets most of the attention.
Leave a Reply