Chapter 19. High Availability under Linux

Table of Contents

19.1. Important Terms
19.2. A Sample Minimum Scenario
19.3. Components of a High Availability Solution
19.4. The Software Side of High Availability
19.5. Clustering
19.6. For More Information

Abstract

This chapter contains a short overview of the key concepts and tools from the area of high availability under Linux. It also offers suggested further reading for all the topics mentioned.

High availability describes systems that can mask certain malfunctions — in particular, the failure of individual computers — so the service can be made available to the user again after only a short downtime. Hardware and software are carefully coordinated and laid out for redundancy, enabling an automatic switch to the other components in the event of a malfunction. High availability differs from “error tolerance” because the service is temporarily unavailable for the short service switchover phase, which can be noticed in delays or short losses in connection.

A high availability system particularly means when the overall availability of the service is between 99.999 percent and 99.99999 percent. This corresponds to a downtime of between five minutes and three seconds over an entire year. The most important factor is not just the software and hardware side, but, primarily, well-conceived system administration with well-documented and understandable processes for minimizing faults. In every case, it involves weighing risks and costs. Different requirements and solutions may be appropriate, depending on the application scenario. Your Novell partner will be happy to advise you.

19.1. Important Terms

Here are a few important terms related to high availability:

SPOF

Single Point of Failure: Component of a system whose failure impairs the functioning of the whole system.

Failover

Another similar system component automatically takes over the function of a failed component.

Cold Standby

The alternative hardware is on cold standby. The failover must be performed manually, so the failure will be clearly apparent.

Warm Standby

The backup system runs in the background, so the transfer can take place automatically. The data on both systems is automatically synchronized. For the user, the failover is like a very fast automatic service reboot. However, the current transaction may be aborted because it was not possible to synchronize the data prior to failure.

Hot Standby

Both systems permanently run in parallel — data on both systems is one hundred percent synchronized. Users will not be aware of any failures. This level cannot usually be reached without making a corresponding modification to the client. To run both systems completely synchronously, the connections to the client must be mirrored one hundred percent. This normally requires clients that have connections with two or more servers at the same time and that communicate with all of them. A normal web browser cannot do this.

Load Balancing

The distribution of load within a cluster of computers. Load balancing is used in an LVS scenario (Linux virtual server), for example (see Section 19.5.2. “Linux Virtual Server”).

STONITH

Shot the other node in the head: Special hardware and software that ensures that a faulty node does not write-access distributed media within a cluster, threatening data consistency in the entire cluster. This involves simply disconnecting the system from the main power supply.