The authors have expanded and updated their classic 1992 work to include:
• New regulatory developments
• ATM principles and information formats
• Broadband signaling, connection control, and network-level management
• Traffic performance and management
• Resilience in a multilayer transport network
• Practical information on design and implementation
• Network planning and tools
• An expanded look at network performance and synchronization
An invaluable guide if you are involved in transport network design, planning, implementation, or management, this book provides a wealth of practical experience and insights drawn from the knowledge of pioneers in the field.
Read an Excerpt
Chapter 10: Network ResilienceModem industrialized societies are becoming increasingly dependent on telecommunications. Major service outages for whatever cause, whether it be human error or component failure, can cause expensive disruption to large sections of the community. Maintaining service availability even under failure conditions has therefore become a prime objective. Survivability of the transport network in the presence of cable breaks and equipment failure is now a routine requirement while graceful recovery from major disaster is a central operational and planning preoccupation for most network operators. Synchronous digital hierarchy (SDH) provides a number of standardized mechanisms to provide protection against most low-level failures while the managed flexibility inherent in both asynchronous transfer mode (ATM) and SDH provides the basic features to support effective wide area service restoration.
This chapter introduces the concept of availability, explores the factors that determine achievable availability, and reviews techniques and procedures that may be used to ensure complex layer networks can be operated with the expected availability.
10.1 SERVICE AVAILABILITY IN MULTILAYER NETWORKS
We can recognize two distinct responses to network failure. The first is service-oriented and is directed towards restoring interrupted services on failure by attempting to reestablish the failed service using spare resources, or resources released by dropping lower priority traffic. This has led to the development of processes that respond to loss of service by essentially renegotiating with the network to locate suitable alternative resources withwhich to restore an acceptable level of service. The term service "restoration" is often used to describe this class of activities.
The other is resource-oriented and is directed towards improving the basic availability of the network components so that failure does not result in loss of service. Naturally, design and process improvement plays a part in improving basic reliability, but the ultrahigh levels of component availability currently demanded has led to the provision of low-level autonomous mechanisms that exploit design redundancy. This ensures that failures are detected and repaired before a loss of service can be declared. These have been termed "protection" mechanisms.
We don't intend to make too much of the terminological distinction between restoration and protection because, as we shall see, it is not always easy to make such a distinction in practical systems. Nevertheless, it is interesting to note the difference between the top-down and bottom-up approaches to resilience that have developed in each of the network layers. The selection of appropriate strategies in each layer of a complex multilayer network is a major network design issue.
10.1.1 Service Impact of Network Failure
The service impact of the failure of a network component naturally depends on the scope of the failure but also on the tolerance of the supported services. Voice telephony, for example, can still operate satisfactorily in the presence of quite large transport failure because telephony networks are usually planned to provide physically diverse transport alternatives between public switched telephone network (PSTN) nodes. Existing calls using the failed facility are dropped and the correspondents must themselves reestablish communication on the remaining facilities. Providing this does not happen too often, the resulting inconvenience is usually considered minor and acceptable. If there is inadequate provision in terms of spare capacity such that congestion results from the call reattempts, the service impact may quickly become unacceptable. This is a very basic form of service restoration where the subscribers themselves are the agents initiating the activity, applying their own individual priority decisions about whether to redial or not and how long to persist, in the event of repeated failures, in attempting to reestablish connection. Alternate routing algorithms in the circuit layer may also be used to provide alternative routings to replace those lost with the failure. If the alternative resource is scarce, alternative routes are provided on a first come, first served basis. This is a powerful model that is also the basis of a class of path restoration systems finding application as an autonomous process in the transport network.
Data communication is more diverse in the rates and protocols used and also in the user services supported. Some low-rate transaction systems such as those used for electronic funds transfer at point of sale (EFTPOS) are in a similar class to voice telephony in levels of inconvenience and scope, but interruption of remote security alarm systems due to communications failure, even if the scope is not large, could be much more serious.
Connectionless data networks such as those that have developed to support the Internet employ automatic and adaptive procedures that make them resilient in a similar way to the telephone network, even when using very unreliable network resources, provided they have access to alternative transport routing possibilities that are unlikely to fail simultaneously.
Users of dedicated leased lines are generally the most vulnerable to failure. It is common in today's network for such leased line users to also lease spare capacity so they can implement their own recovery procedures. Guaranteeing diversity to such users is not a trivial problem for operators who for other reasons need to decouple physical network operations from services. This is made even more difficult in a multioperator situation where a user service may in fact be subcontracted in parts to several independent operators who have neither the capability nor the motivation to safeguard mutual diversity. Total loss of such service for rather long and uncertain periods can be catastrophic to many business users, whose businesses may have become dependent on them.
Where protection or restoration mechanisms are applied, the response time can be critical to the perceived severity of the impact from a service perspective. Again, telephony is not seriously impacted by interruptions short enough that no calls are lost. If, however, the interruption has wide scope and is long enough to drop a large number of calls in progress, the subsequent call reestablishment activity in the PSTN can cause transient signaling overload, with an impact comparable to that with no protection at all. Longer response times are, in general, tolerable for less frequent occurrences or for failures of smaller scope if this can be traded against other advantages such as cost efficiency.
Quality expectations are undeniably higher today, and the trend is towards high availability across a broad range of services. In parallel, the evolving market environment stimulated by deregulation and competition is demonstrating a considerable interest in a broader range of service availabilities. So, we can expect that ultrahigh availability will be obtainable at a price, but there will also be a place for "bargain price" services with "reasonable" but lower availability. Exactly where the balance will be struck in terms of absolute availability targets and range of tariff-related tradeoffs in differentiated services is, as usual, a complex interaction of technology, economics, and market dynamics, and only time will tell. . . .