52 p5-550 Technical Overview and IntroductionThe operating system cannot program or access the temperature threshold using the SP.EPOW events can, for example, trigger the following actions: Temperature monitoring, which increases the fans speed rotation when ambienttemperature is above a preset operating range. Temperature monitoring warns the system administrator of potentialenvironmental-related problems. It also performs an orderly system shutdown when theoperating temperature exceeds a critical level. Voltage monitoring provides warning and an orderly system shutdown when the voltage isout of the operational specification.3.2.4 Self-healingFor a system to be self-healing, it must be able to recover from a failing component by firstdetecting and isolating the failed component, taking it off line, fixing or isolating it, andreintroducing the fixed or replacement component into service without any applicationdisruption. Examples include: Bit steering to redundant memory in the event of a failed memory module to keep theserver operational. Bit-scattering, thus allowing for error correction and continued operation in the presenceof a complete chip failure (Chipkill recovery). Single bit error correction using ECC without reaching error thresholds for main, L2, andL3 cache memory. L3 cache line deletes extended from 2 to 10 for additional self-healing. ECC extended to inter-chip connections on fabric and processor bus. Memory scrubbing to help prevent soft-error memory faults. Dynamic processor deallocation, a deallocated processor can be replaced by an unusedCapacity on Demand processor to keep the system operational.Memory reliability, fault tolerance, and integrityThe p5-550 uses Error Checking and Correcting (ECC) circuitry for system memory to correctsingle-bit and to detect double-bit memory failures. Detection of double-bit memory failureshelps maintain data integrity. Furthermore, the memory chips are organized such that thefailure of any specific memory module only affects a single bit within a four-bit ECC word(bit-scattering), thus allowing for error correction and continued operation in the presence ofa complete chip failure (Chipkill recovery). The memory DIMMs also use memory scrubbingand thresholding to determine when spare memory modules within each bank of memoryshould be used to replace ones that have exceeded their threshold of error count (dynamicbit-steering). Memory scrubbing is the process of reading the contents of the memory duringidle time and checking and correcting any single-bit errors that have accumulated by passingthe data through the ECC logic. This function is a hardware function on the memory controllerchip and does not influence normal system memory performance.3.2.5 N+1 redundancyThe use of redundant parts allows the p5-550 to remain operational with full resources: Redundant spare memory bits in L1, L2, L3, and main memory Redundant fans Redundant power supplies (optional)