Tue Dec 13 06:32:59 PST 2011
- Previous message: [Slony1-general] Slony for update
- Next message: [Slony1-general] Proposed Failover changes for 2.2
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've been working on improving the reliability of the FAILOVER command when multiple nodes fail at the same time. The changes I've had to make have made the failover time logic very complicated. An alternative is restrict cluster configurations that can be used with the failover command. I am proposing something along the lines of: If you have an origin node that you want to failover then the failover command would take a list of the failed nodes. It would then look for backup nodes that meet the following criteria * The backup node for a given origin is a subscribed forwarder to ALL sets that the failed node is an origin for. * The backup node has bi-directional paths to all nodes that the failed origin has paths to * Any other nodes that are not being fed from the one of the potential failover targets will be dropped Out of the nodes that meet the above criteria the failover command would then pick the most ahead node and make that the new origin for the sets from the failed node. After the failover command finishes you could then use MOVE SET and SUBSCRIBE SET to reshape the cluster as you please. How would this work: Example 1: 1---->2 FAILOVER ( node=(id=1)); would fail node 1 to 2. Example 2: 1---->2---->4 | . \----3...... (3 and 4 are connected with a PATH but have no subscription using it) FAILOVER(node=(id=1)); would result in a message such as 'node 1 has no failover targets' because node 1 has paths to both 2 and 3, but no other node has paths to both nodes 2 and 3. Example 3: 1---->2---->4 | . \-----3 FAILOVER(node=(id=1)) slonik would pick one of 2 or 3 and failover to it. It would pick the one that is most ahead. Example 4: 1---->2---->4 | v 3 FAILOVER (node=(id=1), node=(id=2)); Results in 'no 1 has no failover targets' The above cluster can't survive both node 1 and 2 failing at the same time. Example 5: 1(set1)----->2(set1)----->4(set1) | (set2) . | . V ............... 3 (set1,set2) | | 5(set2) Node 3 is the only acceptable failover target. Node 4 would be unsubscribed or dropped. Example 6 |<--------------->4 (set2) 1(set1)------>2--\ | V . 7 3........ | 5 6 In this example node 4 is the origin for set2, it replicates to node 1 which is the origin for set 1. Nodes 2,3 then receive sets 1 and 2 from node 1. Node 4 is a subscriber for set 1. FAILOVER( node=(id=1)) would give node 2 or 3 as a failover target. Node 4 would be unsubscribed/dropped from set 1. It is possible that set 2 would need to be dropped from all nodes. I realize that this means some existing clusters will no longer work with failover but I have doubts if the existing failover code will work 100% of the time for clusters of that type of configuration anyway. I also think it is safer for slonik to make the most ahead node the new master and then let you reshape the cluster with move set. Today if additional things go wrong in the middle of a FAILOVER procedure it can be very difficult to recover the cluster. I feel that if we just promote the most ahead node to the new master things will be safer. I am proposing this change for 2.2, do any users object to this type of change? Is anyone using slony for failover building non-standard cluster configurations?
- Previous message: [Slony1-general] Slony for update
- Next message: [Slony1-general] Proposed Failover changes for 2.2
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list