Disclaimer: iSeries and AS400 are servers. IBM I is an operating system. I use these terms interchangeably to make it easy for folks to find this kind of information on the web.
A Single Point of Failure (SPOF) is any component of an IT application that, if it were to fail, would cause the application to be unavailable to users. In other words, the system would be DOWN. There are many components of an IT application and the goal of this article is to identify the SPOFs and provide alternatives for minimizing their impact on application availability. IT components not only include hardware and software, but also include the network and power source.
Power
Dual (redundant) power supplies are standard features on IBM Power servers and also on expansion drawers for disk drives and adapters. Power supplies are not a SPOF, but they should be connected to different power sources to eliminate the power source as a SPOF. In smaller shops one power supply is connected to a UPS and the other power supply is connected directly to utility power. If either utility power or the UPS should fail, electricity will still be supplied to the server. In larger shops, each power supply is attached to a different UPS. And for installations with the highest up time requirements, the UPS will have a backup generator. Electricity should never be a SPOF for IBM Power Servers.
On IBM HMC’s (Hardware Management Consoles) dual power supplies are a recommended option, not a standard feature. So, make sure that you ask for it!
Network
Users access your server through your network. If the network is down, or the server’s connection to the network fails, then your application is down. Ideally, your server should have more than one connection to the network. Standard 1GB network adapters for IBM Power servers are inexpensive, and have 4 ports. Network adapter cards rarely fail, but they do fail and when that happens users cannot access the server. Even if you only need 1 network connection, your server should be configured with 2 network adapters so that they are not a SPOF. Using Link Aggregation or Virtual IP, you can be using multiple adapters so as not to worry if one fails.
Disk
IBM i POWER servers offers two disk protection options that eliminate a single disk drive as a SPOF, mirroring and RAID. With mirroring, each disk drive is assigned a mirrored backup. Whenever a disk write occurs, the operating system writes the data to the mirrored pair, keeping the pair in sync. The RAID array maintains redundant data that allows the array to rebuild data from a failed disk drive.
When a disk drive failure occurs the server stays fully operational, but the mirrored pair or the RAID array, have become a SPOF until the failed disk drive has been replaced. With RAID, there is noticeable system performance degradation while the RAID array has to rebuild data from a failed disk drive. With mirroring, a failed disk drive does not affect system performance.
You can eliminate a disk drive failure from causing the disk subsystem to become a SPOF by using the Hot Spare option. One or more disk drives can be designated as Hot Spares that automatically replace a failed disk drive. Hot Spares can be designated in both mirroring and RAID environments. When a disk drive fails, IBM i will rebuild the data from the failed drive on to the hot spare and the system will return to normal operation. With Hot Spare, the disk subsystem is only a SPOF during data rebuild – which depends on the size of the drive. Without Hot Spare, the disk subsystem is a SPOF until the failed disk drive is replaced – hours or days. That means that a second disk drive failure to a RAID array or a mirrored pair will bring the system down. Considering the low cost of disk drives and the high cost of application downtime, Hot Spares are highly recommended.
Disk drives are connected to the server via disk adapters. Current IBM i systems are configured with paired disk adapters. Should one adapter fail, the remaining adapter will automatically take over and the server will remain operational.
Tape
Few servers these days use tape drives in the operation of the application. Tape is commonly used just to backup servers. If the tape drive failed, you would not be able to run a backup but the application would continue to run. If a tape drive is required for the operation of your application, let’s say you need to create a tape to send data to another server, then a second tape drive and tape adapter are needed to eliminate tape as a SPOF.
The Motherboard
A server’s motherboard is a SPOF. There can only be one motherboard in a Scale-out IBM Power Server, but with Enterprise class Power Servers each node has a motherboard. The design of the motherboard incorporates many features to prevent server failure due to the failure of a critical component. If a processor core fails, it can be de-configured and replaced with a spare core. Memory failures can also be circumvented using the self-healing techniques built in to every Power server. There is no call to action here. The motherboard is extremely reliable and self-healing, but for more information read the Reliability, Availability and Serviceability (RAS) chapter of the Technical Overview and Introduction IBM Redbook for your server model.
Single Points of Failure
The motherboard and the software components (operating system, IBM licensed programs, 3rd party programs or application code) of your server are SPOF’s. There is little or nothing that you can do in the configuration of the server to prevent a system failure should one of these components fail.
However, there is something that can be done to keep your application running when a server fails, and that is a High Availability/Disaster Recovery (HA/DR) solution. With HA/DR, a second server, the backup server, is available to run the production workload when the production server fails. There are many different HA/DR options available that vary in cost, the time to recover from an outage (RTO- Recovery Time Objective), and the amount of possible data loss (RPO – Recovery Point Objective).
Clustering provides the highest level of availability. With clustering, multiple servers are setup to run the production workload and when a server fails, the other servers in the cluster pick up its workload and there is no downtime. Clustering is the most expensive HA/DR solution and requires modifications to applications to implement Commitment Control.
Data Replication provides the next level of high availability, either through hardware or software data replication. With data replication, the data on the backup (target) server is kept in sync with the production (source) server by replicating updates as they occur. With data replication, there will be a short outage after the production server fails and before the backup server is made ready to run the production workload. Data replication solutions are very popular on IBM i.
With other HA/DR options, Backup/Restore via tape for example, the backup server will eventually come online to run the production workload, but there will be an outage (hours to days) and possibly some lost data.
Summary
You can increase application availability by eliminating SPOF’s:
- Plug redundant power supplies into redundant power sources.
- Have a spare network adapter that is configured, connected to the network and ready to go.
- Use disk hot spares to keep the disk subsystem fully protected from disk failures.
- If your application cannot suffer an outage longer than it takes for a repair, then look at High Availability/Disaster Recovery solutions.
Every company has different needs and requirements. Give us a call or send me an email at blosey@source-data.com so we can help you determine how best to reduce your failures and be prepared with a solution that is best for you.
Need Help?
Call us at 714-593-0387 or email me at blosey@source-data.com.
Leave a Reply