IBM BladeCenter PS703 Technical Overview And Introduction

Also see for BladeCenter PS703: Service guide

Contents

128 IBM BladeCenter PS703 and PS704 Technical Overview and Introductionsystem memory to reload the cache line from main memory. Modified data would be handledthrough Special Uncorrectable Error handling.L2 and L3 deleted cache lines are marked for persistent deconfiguration on subsequentsystem reboots until they can be replaced.4.3.5 Special uncorrectable error handlingAlthough rare, an uncorrectable data error can occur in memory or a cache. IBM POWERprocessor-based systems attempt to limit, to the least possible disruption, the impact of anuncorrectable error using a well-defined strategy that first considers the data source.Sometimes, an uncorrectable error is temporary in nature and occurs in data that can berecovered from another repository. Consider the following examples: Data in the instruction L1 cache is never modified within the cache itself. Therefore, anuncorrectable error discovered in the cache is treated as an ordinary cache miss, andcorrect data is loaded from the L2 cache. The L2 and L3 cache of the POWER7 processor-based systems can hold an unmodifiedcopy of data in a portion of main memory. In this case, an uncorrectable error wouldtrigger a reload of a cache line from main memory.In cases where the data cannot be recovered from another source, a technique called SpecialUncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory orcache from immediately causing the system to terminate. Rather, the system tags the dataand determines whether it will ever be used again. Note the following information: If the error is irrelevant, it does not force a check stop. If the data is used, termination can be limited to the program or kernel, or hypervisorowning the data. Also possible is the freezing of the I/O adapters that are controlled by anI/O hub controller if data is to be transferred to an I/O device.When an uncorrectable error is detected, the system modifies the associated ECC word,thereby signaling to the rest of the system that the standard ECC is no longer valid. Theservice processor is notified, and takes appropriate actions. When running AIX (since V5.2and later) or Linux, and a process attempts to use the data, the operating system is informedof the error and might terminate, or might only terminate a specific process associated withthe corrupt data. This depends on the operating system and firmware level and whether thedata was associated with a kernel or non-kernel process.Only in the case where the corrupt data is used by the POWER Hypervisor must the entiresystem must be rebooted, thereby preserving overall system integrity.Depending on system configuration and source of the data, errors encountered during I/Ooperations might not result in a machine check. Instead, the incorrect data is handled by theprocessor host bridge (PHB) chip. When the PHB chip detects a problem it rejects the data,preventing data being written to the I/O device.The PHB enters a freeze mode that halts normal operations. Depending on the model andtype of I/O being used, the freeze might include the entire PHB chip, or a single bridge. Thisresults in the loss of all I/O operations that use the frozen hardware until a power-on reset ofthe PHB is performed. The impact to partitions depends on how the I/O is configured forredundancy. In a server configured for fail-over availability, redundant adapters spanningmultiple PHB chips can enable the system to recover transparently, without partition loss.