Failure recovery for distributed processes in single system image clusters
1998; Springer Science+Business Media; Linguagem: Inglês
10.1007/3-540-64359-1_728
ISSN1611-3349
Autores Tópico(s)Interconnection Networks and Systems
ResumoSingle System Image (SSI) Distributed Operating Systems have been the subject of increasing interest in recent years. This interest has been fueled primarily by the trend towards hardware designs that address the scalability problems of traditional Symmetric Multiprocessor (SMP) architectures. These architectures run the gamut between inexpensive compute nodes connected by high-speed interconnects and architectures in which some or all memory is shared between nodes. As machines scale to large numbers of nodes, it becomes increasingly intolerable to allow the failure of any one single node to bring down an entire system. Handling failures can dramatically improve the overall system reliability and availability. Amongst the various components of a distributed operating system, the distributed processing component provides significant failure recovery challenges. This is owing to the large number of relationships processes can participate in and the potential for process state to be distributed over many nodes. This paper presents a failure recovery design for the distributed processing component of Tandem's NonStop Clusters for Unixware SSI distributed computing technology. The failure handling issues, design objectives, detailed design, and implementation experience will be presented. In the end, it will be shown it is possible to construct efficient and robust mechanisms to track relationships and maintain appropriate SSI semantics in the face of arbitrary failures in a clustered environment.
Referência(s)