Chapter 4. Continuous availability and manageability 131Platform errors are faults related to: The sysplanar: that part of the server composed of the central processor units, memory,storage controls, and the I/O hubs The power and cooling subsystems The firmware used to initialize the system and diagnose errorsRegional errors are faults that affect some, but not all partitions. They are detected by thePOWER Hypervisor or the Service Processor.Local errors are faults detected in a partition (by the partition firmware or the operatingsystem) for resources owned only by that partition. The POWER Hypervisor and ServiceProcessor are not aware of these errors. Local errors might include “secondary effects” thatresult from platform errors preventing partitions from accessing partition-owned resources.Examples include PCI adapters or devices assigned to a single partition. If a failure occurs toone of these resources, only a single operating system partition need be informed.This section provides an overview of the progressive steps of error detection, analysis,reporting, notifying, and repairing that are found in all POWER processor-based systems.4.4.1 DetectingThe first and most crucial component of a solid serviceability strategy is the ability to detecterrors accurately and effectively when they occur. Although not all errors are a guaranteedthreat to system availability, those that go undetected can cause problems because thesystem does not have the opportunity to evaluate and act if necessary. POWERprocessor-based systems employ IBM System z® server-inspired error detectionmechanisms that extend from processor cores and memory to power supplies and harddrives.Service processorThe service processor is a separate microprocessor from the main instruction processingcomplex. The service processor provides the capabilities for the following elements: POWER Hypervisor (system firmware), IVM, Service and Support Module (SSM) underthe SDMC, and BladeCenter Advanced Management Module (AMM) coordination Remote power control options Reset and boot features Environmental monitoringThe service processor monitors the server’s built-in temperature sensors and sends thisinformation to the BladeCenter AMM. The AMM can send instructions to the BladeCenterfans to increase rotational speed when the ambient temperature is beyond the normaloperating range. Using an architected operating system interface, the service processornotifies the operating system of potential environmental problems so that the systemadministrator can take appropriate corrective actions before a critical failure threshold isreached.The service processor can also post a warning and initiate an orderly system shutdown inthe following circumstances:– The operating temperature exceeds the critical level (for example, failure of airconditioning or air circulation around the system)– The system fan speed is out of operational specification (for example, because ofmultiple fan failures)