Chapter 3. Capacity on Demand, RAS, and manageability 51Figure 3-1 Schematic of Fault Isolation Register implementationThe FIRs are important because they enable an error to be uniquely identified, thus enablingthe appropriate action to be taken. Appropriate actions might include such things as a busretry, ECC correction, or system firmware recovery routines. Recovery routines can includedynamic deallocation of potentially failing components.Errors are logged into the system non-volatile random access memory (NVRAM) and the SPevent history log, along with a notification of the event to AIX for capture in the operatingsystem error log. Diagnostic Error Log Analysis (diagela) routines analyze the error logentries and invoke a suitable action such as issuing a warning message. If the error can berecovered, or after suitable maintenance, the service processor resets the FIRs so that theycan accurately record any future errors.The ability to correctly diagnose any pending or firm errors is a key requirement before anydynamic or persistent component deallocation or any other reconfiguration can take place.For more information, see “Dynamic or persistent deallocation” on page 53.3.2.3 Permanent monitoringThe SP included in the p5-550 provides a means to monitor the system even when the mainprocessor is inoperable. See the following subsections for a more detailed description ofmonitoring functions in p5-550.Mutual surveillanceThe SP can monitor the operation of the firmware during the boot process, and it can monitorthe operating system for loss of control. This allows the service processor to take appropriateaction, including calling for service, when it detects that the firmware or the operating systemhas lost control. Mutual surveillance also allows the operating system to monitor for serviceprocessor activity and can request a service processor repair action if necessary.Environmental monitoringEnvironmental monitoring related to power, fans, and temperature is done by the SystemPower Control Network (SPCN). Environmental critical and non-critical conditions generateEarly Power-Off Warning (EPOW) events. Critical events (for example, Class 5 AC powerloss) trigger appropriate signals from hardware to impacted components so as to prevent anydata loss without the operating system or firmware involvement. Non-critical environmentalevents are logged and reported through Event Scan.CPUL1 CacheL2/L3 CacheMemoryFault Isolation Register (FIR)(unique fingerprint of eacherror captured)ServiceProcessorNon-volatileRAMError CheckersLog ErrorDisk