Artigo Acesso aberto Revisado por pares

Experience with Restoration of Asia Pacific Network Failures from Taiwan Earthquake

2007; Institute of Electronics, Information and Communication Engineers; Volume: E90-B; Issue: 11 Linguagem: Inglês

10.1093/ietcom/e90-b.11.3095

ISSN

1745-1345

Autores

Yoshifumi Kitamura, Younho Lee, Rubens Zenko Sakiyama, Kei Okamura,

Tópico(s)

Mobile Agent-Based Network Management

Resumo

IntroductionAs the Internet grows, networks become larger and more complex, and the number of components, such as routers, switches, and fiber cables, increases.In complicated network systems, it is difficult to implement global network management across several Internet service providers (ISPs) that use a lot of network components in a large-scale network topology.Fault management is a particularly important network management issue in complex network systems because the Internet has become essential to business and research.However, we are only beginning to learn how to deal with global network failures in large networks.Failures have been reported in Sprint Internet protocol (IP) backbone, which shows that failures can be observed in everyday operation (Iannanccone et al., 2002; Markopulou et al., 2004).However, the network failures observed by (Iannaccone et al., 2002) and(Markopoulou et al., 2004) were short-lived and small scale, and their impacts were analyzed only in the context of a single ISP.Most network backup or fault restoration methods have been studied and proposed for the various layers such as wavelength division multiplexing (WDM), multi-protocol label switching (MPLS), or IP (Fumagalli & Valcarenghi, 2000;Gerstel & Ramaswami, 2000;Ramamurthy et al, 2003; Saharabuddhe et al., 2004;Sharma & Hellstrand, 2003).Yet, the proposed backup and restoration methods have not been fully implemented and deployed in the real network.Since real networks are more complicated than theoretical ones, the impacts of network failures on users and ISP's cannot be completely predicted and analyzed.Significant network failures due to natural disasters such as earthquakes, floods, or fires could have particularly wide impact on several ISPs.We discuss the results of the critical network failures that occurred after the Taiwan earthquake in Dec. 2006, which cut fibers and caused network failures.We also explain how restoration methods such as automatic border gateway protocol (BGP) (Lougheed & Rekhter, 1989) re-routing, BGP policy change, and switch reconfiguration were conducted.We hope that the experience and knowledge we gained during the process of recovering www.intechopen.comEarthquake Research and Analysis -Statistical Studies, Observations and Planning 362 from this huge natural disaster, which affected the global Internet, can be shared and can contribute to future Internet network management research.To the best of our knowledge, this is the first detailed study of network restoration after global network failures due to a natural disaster.Although many natural disasters have occurred in the 21st century, until recently there had been no simultaneous outage of the global Internet backbone.However, the earthquake that occurred around Taiwan in 2006 made several Asia Pacific Research and Education (R&E) networks unreachable.At 21:26 and 21:34 on December 26th (UTC+9), 2006, there was a big undersea earthquake off the coast of Taiwan twice, which measured 7.1 and 6.9 respectively on the moment magnitude (Hanks & Kanamori, 1979).This earthquake caused significant damage to the undersea fiber cable systems in that area.Several ISPs were affected because each cable system is shared by multiple ISP's.This earthquake had the effect of dividing the Asia Pacific R&E networks into an eastern and a western group.The Asia Pacific R&E networks were, in particular, seriously damaged but were fully restored after several restoration steps, including automatic BGP re-routing, BGP policy changes, and switch port reconfigurations, were taken.The first step in recovery after the earthquake was taken automatically by BGP routers, which detoured traffic along redundant routes.In BGP routing, there are usually multiple redundant AS paths.Redundant BGP routes have served as backup paths but have provided poor quality connectivity, i.e., long round trip time (RTT).Because of the congestion on the narrow-bandwidth link that was subsequently reported, operators took manual control of traffic to improve communication quality.The second step was a traffic engineering process intended to prevent narrow-bandwidth links from filling up with detoured traffic.The operators changed the BGP routing policy related to the congested ASs.In spite of the routing-level restoration, a few institutions were still not directly connected to the R&E network community because they had only a single link to the network.For these single-link networks, the commodity link was used temporarily for connectivity.However, the commodity link was not stable and not sufficient to carry a huge amount of bandwidth or to provide next generation Internet service.To restore the singlelink networks, cable connection configurations at the switches were changed.The fiber break caused by the Taiwan earthquake raised restoration issues related to BGP rerouting.In such an emergency, the backup routes should be chosen based on available bandwidth and RTT.Since the fiber break required an urgent network recovery process, network operators configured re-routing based on their experience with bandwidth and RTT.From this experience, we have learned that redundant physical backup links and routes are important to providing bandwidth and connectivity and that the Quality-of-Service (QoS) after recovery is also important.From the viewpoint of restoration after network failures, there are still challenges that cannot be automatically overcome by network management systems.A systematic risk management plan that includes collaboration among operators of the next-generation Internet is needed.The remainder of this chapter is as follows.In Section 2, the Asia Pacific R&E networks that were damaged by the earthquake or related events are introduced.Section 3 introduces the R&E connection especially in Asia Paicifc area and the issues caused by such inter-connectivity of R&E networks.Section 4 is a detailed report of the network failures that were observed after the earthquake.Section 5 describes the processes to restoring the disrupted communications in the area.Section 6 discusses what we have learned from the observation of the network failures and recovery processes.Finally, we conclude the paper in Section 7.

Referência(s)