IBM BladeCenter PS703 Technical Overview And Introduction

Also see for BladeCenter PS703: Service guide

Contents

130 IBM BladeCenter PS703 and PS704 Technical Overview and IntroductionClient control of the service environment extends to firmware maintenance on all of thePOWER processor-based systems. This strategy contributes to higher systems availabilitywith reduced maintenance costs.This section provides an overview of the progressive steps of error detection, analysis,reporting, notifying, and repairing found in all POWER processor-based systems.The term servicer, when used in the context of this discussion, denotes the person taskedwith performing service-related actions on a system. For an item designated as a CustomerReplaceable Unit (CRU), the servicer might be the client. In other cases, for FieldReplaceable Unit (FRU) items, the servicer might be an IBM representative or an authorizedwarranty service provider.Service can be divided into three main categories: Service Components: The basic service-related building blocks Service Functions: Service procedures or processes containing one or more servicecomponents Service Operating Environment: The specific system operating environment, whichspecifies how service functions are provided by the various service componentsThe basic component of service is a Serviceable Event.Serviceable events are platform, regional, and local error occurrences that require a serviceaction (repair). This action can include a call home to report the problem so that the repaircan be assessed by a trained service representative. In all cases, the client is notified of theevent. Event notification includes a clear indication of when servicer intervention is required torectify the problem. The intervention might be a service action that the client can perform or itmight require a service provider.Serviceable events are classified as follows:1. Recoverable: This is a correctable resource or function failure. The server remainsavailable, but there might be some decrease in operational performance available forclient’s workload (applications).2. Unrecoverable: This is an uncorrectable resource or function failure. In this instance, thereis potential degradation in availability and performance, or loss of function to the client’sworkload.3. Predictable (using thresholds in support of Predictive Failure Analysis): This is adetermination that continued recovery of a resource or function might lead to degradationof performance or failure of the client’s workload. Although the server remains fullyavailable, if the condition is not corrected, an unrecoverable error might occur.4. Informational: This is notification that a resource or function:– Is out-of or returned-to specification and might require user intervention.– Requires user intervention to complete one or more system tasks.Platform errors are faults that affect all partitions in various ways. They are detected in theblade by the Service Processor, the System Power Control Network, or the Power Hypervisor.When a failure occurs in these components, the POWER Hypervisor notifies each partition’soperating system to execute any required precautionary actions or recovery methods. TheOS is required to report these kinds of errors as serviceable events to the Service Focal Pointapplication because, by definition, they affect the partition.