[Slony1-hackers] Multi node failover draft

Tue Feb 8 09:31:38 PST 2011

Jan Wieck <JanWieck at Yahoo.com> writes:

> Problem
>
> The current steps necessary to perform failover in a Slony replication 
> cluster where multiple nodes have failed are complicated. Even a single 
> node failover with multiple surviving subscribers can lead to unexpected 
> failures and broken configurations. Many DBA's do not have the detail 
> knowledge about Slony-I internals that is needed to write slonik scripts 
> that do the necessary waiting between critical steps in the failover 
> procedure.
>
>
> Proposal summary
>
> To solve this problem we will develop a slonik failover command that 
> receives all the information about which nodes have failed and which 
> sets should be taken over by which backup node at once. This command 
> will perform sanity checks on the remaining nodes, then execute all the 
> necessary steps including waiting for event propagation to get to a new, 
> working cluster shape to the point, where the failed nodes are dropped 
> from the configuration of the remaining cluster.
>
>
> Command syntax
>
> The new FAILOVER CLUSTER command will specify the list of failed nodes. 
> This command is a string of space separated node-ID numbers. Following 
> this are the actions to be taken for every set that originates on one of 
> the failed nodes. There are two possible actions:
>
>     * Specify a backup node that will be the new origin. This backup
>       node must be a forwarding subscriber of the set and one of the
>       surviving nodes.
>
>     * Abandon the set. This can happen when there is not a single
>       forwarding subscriber of the set left.

So, the FAILOVER CLUSTER command will specify one of these actions for
each affected node/set combination?

To clarify the "abandon" option...  I don't imagine we'd want to have
the command "automatically abandon" a set if it wasn't expressly asked
to abandon it.

If it is concluded that a set needs to be abandoned, but slonik hasn't
been expressly asked to do so, an error should be raised, and the
attempt to failover should be abandoned, awaiting the administrator
addressing this problem.  Right?

> Stage 1 - Requirements and sanity checks
>
> For the FAILOVER CLUSTER command it is necessary that the remaining 
> nodes have a sufficient path network.
>
>     * For every surviving set, the surviving subscribers must have a path
>       to either the new origin (backup node), or another surviving
>       forwarder for that set.
>
>     * Nodes that are currently not subscribed to any set at all must have
>       paths that somehow allow a functional listen network to be
>       generated.
>
>     * The slonik script that performs the FAILOVER CLUSTER command must
>       specify admin conninfo's for all remaining nodes, since slonik will

Error checking comments...

If any of these problems are discovered, the slonik script should, I
think, do the following:

1.  Report all of the missing configuration, and
2.  Refuse to run (e.g. - it doesn't try to do part of the work)

> Stage 2 - Disabling slon processes
>
> In order for slonik to perform failover procedures without concurrently 
> running slon processes interfering with it via event processing, we need 
> a way to tell the local nodes slon process to not startup normally but 
> to stop and retry before spawning off all the worker threads. This 
> should be a column inside the sl_node table, possibly the existing 
> no_active column, which seems unused. Stage 2 sets this flag on all 
> surviving nodes and restarts the slon processes via NOTIFY.

If this is the intended use of sl_node.no_active, then that's fine; if
this really is a new purpose, then I don't see any problem in adding a
new column for this purpose.

I'd rather add a new, well-defined column than try to reuse something.

> Stage 3 - Disabling paths to failed nodes.
>
> Failures are not necessarily caused by DB server problems, but are often 
> the result of network problems.
>
> In any failover case, we must not only assume that the failed nodes are 
> no longer reachable. It is actually best practice to forcibly make 
> failed nodes unreachable by network management means. Slony cannot 
> prevent any outside applications from still accessing those "failed" 
> nodes. But we can make sure that the surviving nodes, as defined by the 
> FAILOVER CLUSTER command, will not receive any more events from those 
> nodes that may possibly interfere with the failover procedures. We 
> accomplish this by updating all sl_path entries to/from any of the 
> failed nodes on all remaining nodes to something, that does not 
> represent a valid conninfo string. This way, the remaining nodes will no 
> longer be able to connect and thus, no longer receive any events 
> (outstanding or newly created).

I'd rather that this be pretty explicit.  

I'd add a boolean to sl_path which indicates that the path shouldn't be
used.

But I don't object to adding an additional "belt & suspenders" component
to sl_path.pa_conninfo which we know will cause erroneous attempts at
connection to fail.  (I kind of like the idea of pre-pending the conninfo
with "node=dead ".)

> Stage 4 - Remove abandoned sets from the configuration
>
> All sets that have been specified to be abandoned will be removed from 
> the configuration of all surviving nodes. Slonik will analyze after 
> doing this which was the highest advanced surviving node that was 
> subscribed to the set in order to inform the administrator which node 
> has the most advanced data.

Do we need to have some standard place to put "inform the administrator"
material?

It may be OK to say "slonik output", though I also like the idea of
having slon log this as "SLON_CONFIG" output.

> Comments?

The examples need some more thought, but it's certainly good to have
some reasonably sophisticated examples of how things may break, and
predicted responses.
-- 
(reverse (concatenate 'string "ofni.sailifa.ac" "@" "enworbbc"))
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"