IBM BladeCenter PS703 Technical Overview And Introduction

Also see for BladeCenter PS703: Service guide

Contents

Chapter 4. Continuous availability and manageability 1294.3.6 PCI extended error handlingIBM estimates that PCI adapters can account for a significant portion of the hardware-basederrors on a large server. Although servers that rely on boot-time diagnostics can identifyfailing components to be replaced by hot-swap and reconfiguration, runtime errors pose amore significant problem.PCI adapters are generally complex designs involving extensive on-board instructionprocessing, often on embedded microcontrollers. They tend to use industry-standard-gradecomponents with an emphasis on product cost relative to high reliability. In certain cases,they might be more likely to encounter internal microcode errors, or many of the hardwareerrors described for the rest of the server.The traditional means of handling these problems is through adapter internal error reportingand recovery techniques, in combination with operating system device driver managementand diagnostics. In certain cases, an error in the adapter might cause transmission of baddata on the PCI bus itself, resulting in a hardware-detected parity error and causing a globalmachine-check interrupt, eventually requiring a system reboot to continue.PCI extended error handling (EEH) enabled adapters respond to a special data packet that isgenerated from the affected PCI slot hardware by calling system firmware (that examines theaffected bus), allowing the device driver to reset it and continue without a system reboot. ForLinux, EEH support extends to the majority of frequently used devices, although certainthird-party PCI devices might not provide native EEH support.To detect and correct PCIe bus errors, POWER7 processor-based systems use CRCdetection and instruction retry correction.4.4 ServiceabilityIBM Power Systems design enables IBM to be responsive to the client’s needs. The IBMServiceability Team has enhanced the base service capabilities and continues to implementa strategy that incorporates best-of-breed service characteristics from diverse IBM Systemsofferings.Serviceability includes system installation, system upgrades and downgrades (MES), andsystem maintenance and repair. The goal of the IBM Serviceability Team is to design andprovide the most efficient system service environment. Such an environment includes thefollowing elements: Easy access to service components; design for Customer Set Up (CSU), CustomerInstalled Features (CIF), and Customer Replaceable Units (CRU) On-demand service education Error detection and fault isolation (ED/FI) First-failure data capture (FFDC) An automated guided repair strategy that uses common service interfaces for a convergedservice approach across multiple IBM server platformsBy delivering on these goals, IBM Power Systems servers enable faster and more accuraterepair, and reduce the possibility of human error.