Chapter 3. Capacity on Demand, RAS, and manageability 533.2.6 Fault maskingIf corrections and retries succeed and do not exceed threshold limits, the system remainsoperational with full resources, and no client or IBM customer engineer intervention isrequired. This technology is used in the following faults: CEC bus retry and recovery PCI-X bus recovery ECC Chipkill soft error3.2.7 Resource deallocationIf recoverable errors exceed threshold limits, resources can be deallocated with systemremaining operational, allowing deferred maintenance at a convenient time.Dynamic or persistent deallocationDynamic deallocation of potentially failing components is non-disruptive, allowing the systemto continue to run. Persistent deallocation occurs when a failed component is detected, whichis then deactivated at a subsequent reboot.Dynamic deallocation functions include: Processor L3 cache line delete Partial L2 cache deallocation PCI-X bus and slotsFor dynamic processor deallocation, the service processor performs a predictive failureanalysis based on any recoverable processor errors that have been recorded. If thesetransient errors exceed a defined threshold, the event is logged and the processor isdeallocated from the system while the operating system continues to run. This feature(named CPU Guard) enables maintenance to be deferred until a suitable time. Processordeallocation can only occur if there are sufficient functional processors (at least two).To verify whether CPU Guard has been enabled, run the following command:lsattr -El sys0 | grep cpuguardIf enabled, the output will be similar to the following:cpuguard enable CPU Guard TrueIf the output shows CPU Guard as disabled, enter the following command to enable it:chdev -l sys0 -a cpuguard='enable'Cache or cache-line deallocation is aimed at performing dynamic reconfiguration to bypasspotentially failing components. This capability is provided for both L2 and L3 caches. Dynamicrun-time deconfiguration is provided if a threshold of L1 or L2 recovered errors is exceeded.In the case of an L3 cache run-time array single-bit solid error, the spare chip resources areused to perform a line delete on the failing line.PCI hot-plug slot fault tracking helps prevent slot errors from causing a system machinecheck interrupt and subsequent reboot. This provides superior fault isolation, and the erroraffects only the single adapter. Run-time errors on the PCI bus caused by failing adapters willresult in recovery action. If this is unsuccessful, the PCI device will be gracefully shut down.