Failure recovery for distributed processes in single system image clusters

Capítulo de livro Revisado por pares

Failure recovery for distributed processes in single system image clusters

1998; Springer Science+Business Media; Linguagem: Inglês

10.1007/3-540-64359-1_728

ISSN

1611-3349

Autores

Jeffrey Zabarsky,

Tópico(s)

Interconnection Networks and Systems

Resumo

Single System Image (SSI) Distributed Operating Systems have been the subject of increasing interest in recent years. This interest has been fueled primarily by the trend towards hardware designs that address the scalability problems of traditional Symmetric Multiprocessor (SMP) architectures. These architectures run the gamut between inexpensive compute nodes connected by high-speed interconnects and architectures in which some or all memory is shared between nodes. As machines scale to large numbers of nodes, it becomes increasingly intolerable to allow the failure of any one single node to bring down an entire system. Handling failures can dramatically improve the overall system reliability and availability. Amongst the various components of a distributed operating system, the distributed processing component provides significant failure recovery challenges. This is owing to the large number of relationships processes can participate in and the potential for process state to be distributed over many nodes. This paper presents a failure recovery design for the distributed processing component of Tandem's NonStop Clusters for Unixware SSI distributed computing technology. The failure handling issues, design objectives, detailed design, and implementation experience will be presented. In the end, it will be shown it is possible to construct efficient and robust mechanisms to track relationships and maintain appropriate SSI semantics in the face of arbitrary failures in a clustered environment.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Failure recovery for distributed processes in single system image clusters