Tue Feb 8 07:07:54 PST 2011
- Previous message: [Slony1-hackers] Multi node failover draft
- Next message: [Slony1-hackers] Multi node failover draft
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 11-02-07 03:00 PM, Jan Wieck wrote: > > Stage 2 - Disabling slon processes > > In order for slonik to perform failover procedures without concurrently > running slon processes interfering with it via event processing, we need > a way to tell the local nodes slon process to not startup normally but > to stop and retry before spawning off all the worker threads. This I think you mean stop and wait not retry. From what I read below the slon will wait until stage 6. There might be some polling/looping mechanism used to achieve this but I don't see slon restarting or logging any errors while it is in this state. > should be a column inside the sl_node table, possibly the existing > no_active column, which seems unused. Stage 2 sets this flag on all > surviving nodes and restarts the slon processes via NOTIFY. no_active appears to be used in RebuildListenEntries(). I see at least three states a node can be in. A normal active node A node that is paused as described above but is considered as part of the listen network A node that has a row in sl_node but isn't yet considered as part of the listen network (these are the cases where no_active=false today). I think when a CLONE PREPARE (but not a CLONE FINISH) is done the node is probably in this state. Also when we fail the node but haven't yet dropped it. > > > Stage 3 - Disabling paths to failed nodes. > > Failures are not necessarily caused by DB server problems, but are often > the result of network problems. > > In any failover case, we must not only assume that the failed nodes are > no longer reachable. It is actually best practice to forcibly make > failed nodes unreachable by network management means. Slony cannot > prevent any outside applications from still accessing those "failed" > nodes. But we can make sure that the surviving nodes, as defined by the > FAILOVER CLUSTER command, will not receive any more events from those > nodes that may possibly interfere with the failover procedures. We > accomplish this by updating all sl_path entries to/from any of the > failed nodes on all remaining nodes to something, that does not > represent a valid conninfo string. This way, the remaining nodes will no > longer be able to connect and thus, no longer receive any events > (outstanding or newly created). > > > Stage 4 - Remove abandoned sets from the configuration > > All sets that have been specified to be abandoned will be removed from > the configuration of all surviving nodes. Slonik will analyze after > doing this which was the highest advanced surviving node that was > subscribed to the set in order to inform the administrator which node > has the most advanced data. > > > Stage 5 - Reshaping the cluster > > This is the most complicated step in the failover procedure. Consider > the following cluster configuration: > > A ----- C ----- D > | | > | | > B / > | / > | / > E --/ > > Node A is origin to set 1 with subscribers B, C, D and E > Node C is origin to set 2 with subscribers A, B, D and E > Node E receives both sets via node B > > Failure case is that nodes A and B fail and C is the designated backup > node for set 1. > > Although node E has a path to node C, which could have been created > after the failure prior to executing FAILOVER CLUSTER, it will not use > it to listen for events from node C. Its subscription for set 2, > originating from node C, uses node B as data provider. This causes node > E to listen for events from node C via B, which now is a disabled path. > > Stage 5.1 - Determining the temporary origins > > Per set, slonik will find which is the highest advanced forwarding > subscriber. It will make that node the temporary origin of the set. In > the above example, this can either be node C or E. If there are multiple > nodes qualifying for this definition, the designated backup node is > chosen. There only will be a temporary origin in this test case if node > E is the most advanced. In that case, there will be as per today a > FAILOVER_SET event, faked to originate from node A injected into node E, > followed by a MOVE_SET event to transfer the ownership to node C. If > node C, the designated backup node, is higher advanced, only a > FAILOVER_SET originating from A will be injected into node C without a > following MOVE_SET. > > Note that at this point, the slon processes are still disabled so none > of the events are propagating yet. We also need to deal with non-SYNC events that have escaped the failed node but not yet made it to the temporary origin. For example: you have a replication set 1 with 3 nodes. Node 1 is the origin and a fourth node that is NOT subscribed to the set but has a direct path to 1. 2<---1---->3 | 4 The sl_event on node 1 is: 1,1233, SYNC 1,1234, SYNC 1,1235 DROP SET 1 1,1236 SYNC At the time node 1 fails: node 2 is at 1,1234 node 3 is at 1,1233 node 4 has processed 1,1235. Slonik must examine all nodes in the cluster (not just subscribers) and see that the correct sequence number for the failed FAILOVER_SET event with number 1,1237 NOT 1,1235. Furthermore I think that slonik needs to wait until node 2 has processed event 1,1235 before proceeding with step 5.1 Don't get caught up on the DROP SET versus any other non-SYNC event (it could also be a CREATE SET, STORE NODE, EXECUTE SCRIPT etc...) The important thing is that it is an event that escaped the failed node to a non subscriber(the subscriber might subscribe to another set). The non subscriber will have a record of the event in sl_event (since we store non sync events from remote nodes in sl_event on all nodes). > > > Stage 5.2 - Reshaping subscriptions. > > Per set slonik will > > * resubscribe every subscriber of that set to the (temporary) origin > I think you mean resubscribe every direct subscriber of the old origin to the temporary origin (you deal with non-direct subscribers below) > * resubscribe every subscriber of that set, which does not have a > direct path to the (temporary) origin, to an already resubscribed > forwarder. This is an iterative process. It is possible that slonik > finds a non-forwarding subscriber, that is higher advanced than the > temporary origin or backup node. This nodes subscription must > forcibly be dropped with a WARNING, because it has processed changes > the other nodes don't have but for which the sl_log entries have > been lost in the failure. > > Stage 5.3 - Sanitizing sl_event > > It is possible that there have been SUBSCRIBE_SET events outstanding > when the failure occured. Further more, the data provider for that event > may be one of the failed nodes. Since the slon processes are stopped at > this time, the new subscriber will not be able to perform the > subscription at all, so these events must be purged from sl_event on all > nodes and WARNING messages given. > Is SUBSCRIBE_SET the only event that must be canceled or are there others? (MOVE SET? FAIL_OVER?) > > Stage 6 - Enable slon processes and wait > > The slon processes will now start processing all the outstanding events > that were generated or still queued up until all the faked FAILOVER_SET > events have been confirmed by all other remaining nodes. This may take > some time in case there were other subscriptions in progress when the > failure happened. Those subscriptions had been interrupted when we > disabled the slon processes in stage 2 and just have been restarted. > > > Stage 7 - Sanity check and drop failed nodes > > After having waited for stage 6 to complete, none of the surviving nodes > should list any of the failed nodes as the origin of any set and all > events originating from one of the failed nodes should be confirmed by > all surviving nodes. If those conditions are met, it is safe to initiate > DROP NODE events for all failed nodes. > > > Examining more failure scenarios > > Case 2 - 5 node cluster with 2 sets. > > A ----- C ----- E > | | > | | > B D > > Node A is origin of set 1, subscribed to nodes B, C, D and E. > Node B is origin of set 2, subscribed to nodes A, C, D and E. > > Failure case is nodes A and B fail, node C is designated backup for set > 1, node D is designated backup for set 2. > > * The remaining cluster has a sufficient path network to pass stage 1 > * Stage 2 cannot fail by definition > * Stage 3 cannot fail by definition > * Stage 4 is irrelevant for this example > * Stage 5 will decide that C is the new origin of set 1 via > FAILOVER_SET and that node C will be the temporary origin for set 2 > via FAILOVER_SET followed by an immediate MOVE_SET to D. > All subscriptions for E and D will point to C as data provider. > * Stage 6 will wait until both FAILOVER_SET and the MOVE_SET event > from stage 5 have been confirmed by C, D and E. > * Stage 7 should pass its sanity checks and drop nodes A and B. > > > Case 3 - 5 node cluster with 3 sets. > > A ----- C ----- E > | | > | | > B ----- D > > Node A is origin of set 1, subscribed to nodes B, C, D and E. Node D > uses node C as data provider. > Node B is origin of set 2, subscribed to nodes A and C. Node D is > currently subscribing to set 2 using the origin node B as provider. > Node B is origin of set 3, subscribed to node A only. > > Failure case is nodes A and B fail, node C is designated backup for set > 1 and 2. > > * The remaining cluster has a sufficient path network to pass stage 1 > * Stage 2 cannot fail by definition > * Stage 3 cannot fail by definition > * Stage 4 will remove set 3 from sl_set on all nodes. > * Stage 5 will decide that C is the new origin of sets 1 and 2 via > FAILOVER_SET. Subscriptions for set 1 will point to node C. > It is well possible that the SUBSCRIBE_SET event for node D > subscribing set 2 from B is interrupted by the failover, but since > it causes very little work, had already been processed and confirmed > by nodes C and E. Stage 5.3 will delete these events from sl_event > on nodes C and E. The event would fail when D receives it from E > because it cannot connect to the data provider B. > * Stage 6 will wait until both FAILOVER_SET events from stage 5 have > been confirmed by C, D and E. > * Stage 7 should pass its sanity checks and drop nodes A and B. > > > Comments? > Jan > A few other notes: * This should be done in a way such that if slonik dies halfway through the processes (it is killed by a user, or the OS crashes etc...) then you have some hope of recovering. * Some examples of the proposed syntax would be useful to debate. If anyone wants to enaging in bike-shedding we should do it before you spend time modifying parser.y
- Previous message: [Slony1-hackers] Multi node failover draft
- Next message: [Slony1-hackers] Multi node failover draft
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-hackers mailing list