3.4. Doing switchover and failover with Slony-I

3.4.1. Foreword

Slony-I is an asynchronous replication system. Because of that, it is almost certain that at the moment the current origin of a set fails, the final transactions committed at the origin will have not yet propagated to the subscribers. Systems are particularly likely to fail under heavy load; that is one of the corollaries of Murphy's Law. Therefore the principal goal is to prevent the main server from failing. The best way to do that is frequent maintenance.

Opening the case of a running server is not exactly what we should consider a "professional" way to do system maintenance. And interestingly, those users who found it valuable to use replication for backup and failover purposes are the very ones that have the lowest tolerance for terms like "system downtime." To help support these requirements, Slony-I not only offers failover capabilities, but also the notion of controlled origin transfer.

It is assumed in this document that the reader is familiar with the slonik utility and knows at least how to set up a simple 2 node replication system with Slony-I.

3.4.2. Controlled Switchover

We assume a current "origin" as node1 with one "subscriber" as node2 (e.g. - slave). A web application on a third server is accessing the database on node1. Both databases are up and running and replication is more or less in sync. We do controlled switchover using SLONIK MOVE SET.

You may now simply shutdown the server hosting node1 and do whatever is required to maintain the server. When slon node1 is restarted later, it will start replicating again, and soon catch up. At this point the procedure to switch origins is executed again to restore the original configuration.

This is the preferred way to handle things; it runs quickly, under control of the administrators, and there is no need for there to be any loss of data.

After performing the configuration change, you should, run the Section 5.1.1 scripts in order to validate that the cluster state remains in good order after this change.

3.4.3. Failover

If some more serious problem occurs on the "origin" server, it may be necessary to SLONIK FAILOVER to a backup server. This is a highly undesirable circumstance, as transactions "committed" on the origin, but not applied to the subscribers, will be lost. You may have reported these transactions as "successful" to outside users. As a result, failover should be considered a last resort. If the "injured" origin server can be brought up to the point where it can limp along long enough to do a controlled switchover, that is greatly preferable.

Slony-I does not provide any automatic detection for failed systems. Abandoning committed transactions is a business decision that cannot be made by a database system. If someone wants to put the commands below into a script executed automatically from the network monitoring system, well ... it's your data, and it's your failover policy.

3.4.4. Failover Targets

An origin node can only be failed over to other nodes in the cluster that are valid failover targets. The failover targets for a node must meet the following conditions

The view sl_failover_targets displays the valid failover targets for each origin node. Clusters that have more than two nodes and would like to have the option of using failover need to be setup in such a way that valid failover targets exist for the various failure scenarios that they wish to support.

3.4.5. Multiple Node Failures

If multiple nodes fail at the same time, maybe because an entire data-center has failed, then all failed nodes should be passed to the failover command. If we consider a cluster where node 1 the origin of a set and provides a subscription to node 2 and node 3 then node 2 provides a subscription to node 4, what should happen if both nodes 1 and 2 fail? Slony can be told about the failed nodes with the following command

FAILOVER (node=(id=1, backup node=3), node=(id=2, backup node=3));

This command requires that a paths exist between node 3 and 4. It will then redirect node 4 to receive the subscription from node 3.

3.4.6. Automating FAIL OVER

If you do choose to automate FAIL OVER , it is important to do so carefully. You need to have good assurance that the failed node is well and truly failed, and you need to be able to assure that the failed node will not accidentally return into service, thereby allowing there to be two nodes out there able to respond in a "master" role.

Note: The problem here requiring that you "shoot the failed node in the head" is not fundamentally about replication or Slony-I; Slony-I handles this all reasonably gracefully, as once the node is marked as failed, the other nodes will "shun" it, effectively ignoring it. The problem is instead with your application. Supposing the failed node can come back up sufficiently that it can respond to application requests, that is likely to be a problem, and one that hasn't anything to do with Slony-I. The trouble is if there are two databases that can respond as if they are "master" systems.

When failover occurs, there therefore needs to be a mechanism to forcibly knock the failed node off the network in order to prevent applications from getting confused. This could take place via having an SNMP interface that does some combination of the following:

3.4.7. After Failover, Reconfiguring Former Origin

What happens to the failed node will depend somewhat on the nature of the catastrophe that lead to needing to fail over to another node. If the node had to be abandoned because of physical destruction of its disk storage, there will likely not be anything of interest left. On the other hand, a node might be abandoned due to the failure of a network connection, in which case the former "provider" can appear be functioning perfectly well. Nonetheless, once communications are restored, the fact of the FAIL OVER makes it mandatory that the failed node be abandoned.

After the above failover, the data stored on node 1 will rapidly become increasingly out of sync with the rest of the nodes, and must be treated as corrupt. Therefore, the only way to get node 1 back and transfer the origin role back to it is to rebuild it from scratch as a subscriber, let it catch up, and then follow the switchover procedure.

A good reason not to do this automatically is the fact that important updates (from a business perspective) may have been committed on the failing system. You probably want to analyze the last few transactions that made it into the failed node to see if some of them need to be reapplied on the "live" cluster. For instance, if someone was entering bank deposits affecting customer accounts at the time of failure, you wouldn't want to lose that information.

Warning

It has been observed that there can be some very confusing results if a node is "failed" due to a persistent network outage as opposed to failure of data storage. In such a scenario, the "failed" node has a database in perfectly fine form; it is just that since it was cut off, it "screams in silence."

If the network connection is repaired, that node could reappear, and as far as its configuration is concerned, all is well, and it should communicate with the rest of its Slony-I cluster.

In fact, the only confusion taking place is on that node. The other nodes in the cluster are not confused at all; they know that this node is "dead," and that they should ignore it. But there is not a way to know this by looking at the "failed" node.

This points back to the design point that Slony-I is not a network monitoring tool. You need to have clear methods of communicating to applications and users what database hosts are to be used. If those methods are lacking, adding replication to the mix will worsen the potential for confusion, and failover will be a point at which there is enormous potential for confusion.

If the database is very large, it may take many hours to recover node1 as a functioning Slony-I node; that is another reason to consider failover as an undesirable "final resort."

3.4.8. Planning for Failover

Failover policies should be planned for ahead of time.

Most pointedly, any node that is expected to be a failover target must have its subscription(s) set up with the option FORWARD = YES. Otherwise, that node is not a candidate for being promoted to origin node.

This may simply involve thinking about what the priority lists should be of what should fail to what, as opposed to trying to automate it. But knowing what to do ahead of time cuts down on the number of mistakes made.

At Afilias, a variety of internal [The 3AM Unhappy DBA's Guide to...] guides have been created to provide checklists of what to do when certain "unhappy" events take place. This sort of material is highly specific to the environment and the set of applications running there, so you would need to generate your own such documents. This is one of the vital components of any disaster recovery preparations.