[Slony1-hackers] Failover never completes

Mon Oct 15 19:49:44 PDT 2012

On 12-10-15 09:55 PM, Joe Conway wrote:
> I have a client which is seeing something just like:
>    http://www.slony.info/bugzilla/show_bug.cgi?id=130
> which is a duplicate of
>    http://www.slony.info/bugzilla/show_bug.cgi?id=80
> The latter apparently was never fixed.
>
> The comments in the bug say:
>
<snip>

>
> Here's what we do in a nutshell:
> -----------------------
> A == original master
> B == slave1
> C == new master
> D == slave2
>
> all commands run from C
>
> * switchover from A to B
> * clone A to make C
> * switchback from B to A

Do you make sure that all nodes have confirmed the switchback before 
proceeding to the failover below?  If not it would be better if you did.

> * failover from A to C
> * drop A
> -----------------------
>
> This works fine 90% of the time (using some scripts to ensure we are
> doing it exactly the same each time).
>
> When we do the failover (which is run on/from C), slonik completes the
> failover "successfully" (at least no errors reported by slonik), but
> hours later (i.e. it is not a matter of not waiting long enough I think)
> the original master is still the set_origin in the slony catalog of the
> new master (this is on a test cluster with no activity). Consequently
> when we try to drop the old master it fails (which is probably a good
> thing since the failover was not really successful).
>
>   sl_path looks correct
>   sl_subscribe has an extra row marked active=false with
>     B as the provider (leftover from the switchback?)

Exactly which version of slony are you using?   I assume this isn't bug 
http://www.slony.info/bugzilla/show_bug.cgi?id=260 by any chance?

>   sl_set still has set_origin pointing to A
>   sl_node still shows all 4 nodes as active=true
>
> So questions:
> 1) Is bug 80 still open?
> 2) Any plan to fix it or even ideas how to fix it?

I substantially rewrote a lot of the failover logic for 2.2 (grab master 
from git).  One of the big things holding up a 2.2 release is that it 
needs people other than myself to test it to verify that I haven't 
missed something obvious and that the new behaviours are sane.

A FAILOVER in 2.2 no longer involves that 'faked event' from the old 
origin,  The changes in 2.2 also allow you to specify multiple failed 
nodes as arguments to the FAILOVER command.  The hope is that it 
addresses the issues Jan alludes to with multiple failed nodes.

> 3) Anything obvious we are missing?
> 4) Is there a better/more reliable way to get C stood
>     up as the new master without taking down the cluster
>     longer than the sequence above would do?
>

> Thanks,
>
> Joe
>