IBM United States Hardware Announcement110-009IBM is a registered trademark of International Business Machines Corporation 16• Disk drive fault tracking is designed to alert the system administrator of animpending disk drive failure before it impacts customer operation.Mutual surveillanceThe Service Processor monitors the operation of the firmware during the bootprocess, and also monitors the HypervisorTM for termination. The Hypervisormonitors the Service Processor and will perform a reset/reload if it detects the lossof the Service Processor. If the reset/reload does not correct the problem with theService Processor, the Hypervisor will notify the operating system and the operatingsystem can take appropriate action, including calling for service.Environmental monitoring functionsPOWER7-based servers include a range of environmental monitoring functions:• Temperature monitoring warns the system administrator of potentialenvironmental-related problems by monitoring the air inlet temperature. Whenthe inlet temperature rises above a warning threshold, the system initiates anorderly shutdown. When the temperature exceeds the critical level or if thetemperature remains above the warning level for too long, the system will shutdown immediately.• Fan speed is controlled by monitoring actual temperatures on critical componentsand adjusting accordingly. If internal component temperatures reach criticallevels, the system will shut down immediately, regardless of fan speed. When aredundant fan fails, the system calls out the failing fan and continues running.When a nonredundant fan fails, the system shuts down immediately.Availability enhancement functionsThe POWER7 family of systems continues to offer and introduce significantenhancements designed to increase system availability.POWER7 processor functionsAs in POWER6, the POWER7 processor has the ability to do processor instructionretry and alternate processor recovery for a number of core-related faults. Thissignificantly reduces exposure to both hard (logic) and soft (transient) errors inthe processor core. Soft failures in the processor core are transient (intermittent)errors, often due to cosmic rays or other sources of radiation, and generally are notrepeatable. When an error is encountered in the core, the POWER7 processor willfirst automatically retry the instruction. If the source of the error was truly transient,the instruction will succeed and the system will continue as before. On IBM systemsprior to POWER6, this error would have caused a checkstop.Hard failures are more difficult, being true logical errors that will be replicatedeach time the instruction is repeated. Retrying the instruction will not help in thissituation because the instruction will continue to fail. As in POWER6, POWER7processors have the ability to extract the failing instruction from the faulty coreand retry it elsewhere in the system for a number of faults, after which the failingcore is dynamically deconfigured and called out for replacement. The entire processis transparent to the partition owning the failing instruction. These systems aredesigned to avoid a full system outage.POWER7 single processor checkstoppingAs in POWER6, POWER7 provides single processor checkstopping. This significantlyreduces the probability of any one processor affecting total system availability.Partition availability priorityAlso available is the ability to assign availability priorities to partitions. If analternate processor recovery event requires spare processor resources in orderto protect a workload, when no other means of obtaining the spare resources isavailable, the system will determine which partition has the lowest priority andattempt to claim the needed resource. On a properly configured POWER7 processor-