Tue Oct 16 05:50:26 PDT 2012
- Previous message: [Slony1-hackers] Failover never completes
- Next message: [Slony1-hackers] Failover never completes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 12-10-15 11:20 PM, Joe Conway wrote: > On 10/15/2012 07:49 PM, Steve Singer wrote: >>> all commands run from C >>> >>> * switchover from A to B >>> * clone A to make C >>> * switchback from B to A >> >> Do you make sure that all nodes have confirmed the switchback before >> proceeding to the failover below? If not it would be better if you did. > > Yes -- in fact we wait for confirmation, and then do a sync on each node > and wait for confirmation of those as well. > >>> sl_path looks correct >>> sl_subscribe has an extra row marked active=false with >>> B as the provider (leftover from the switchback?) >> >> Exactly which version of slony are you using? I assume this isn't bug >> http://www.slony.info/bugzilla/show_bug.cgi?id=260 by any chance? > > We are using 2.1.0. We tried upgrading to 2.1.2 but got stuck because we > cannot have a mixed 2.1.0/2.1.2 cluster. We have constraints that do not > allow for upgrade-in-place of existing nodes, which is why we want to > add a new node and failover to it (to facilitate upgrades of components > other than slony, e.g. postgres itself). > So your 1. Adding a new node 2. Stopping the old node 3. Running UPGRADE FUNCTIONS on the new node 4. Starting up the new slon and running 'FAILOVER' ? > I guess if you think this bug is our problem we can set up an entirely > 2.1.2 test environment, but it will be painful, and not solve all our > problems as we have some 2.1.0 clusters that we eventually need to upgrade. > > Is bug 260 issue #2 deterministic or a race condition? Our current > process works 9 out of 10 times... > My recollection was that #260 usually tended to happen, but there are a lot of other rare race conditions I had occasionally hit which lead to the failover changes in 2.2 Does your sl_listen table have any cycles in it, ie a-->b b--->a (or even cycles through a third node) Which nodes have processed the FAILVOVER_SET event? Which (if any) nodes have processed the ACCEPT_SET? Which node is the 'most ahead node', I think slonik reports this on stdout when it runs. Are the remoteWorkerThread_'A' threads running on the other nodes and what are they doing? I'm asking these questions to try and get a sense of what the cluster state is and where the problem might be. > FWIW we only have one set so I don't think issue #1 applies. > >>> sl_set still has set_origin pointing to A >>> sl_node still shows all 4 nodes as active=true >>> >>> So questions: >>> 1) Is bug 80 still open? >>> 2) Any plan to fix it or even ideas how to fix it? >> >> I substantially rewrote a lot of the failover logic for 2.2 (grab master >> from git). One of the big things holding up a 2.2 release is that it >> needs people other than myself to test it to verify that I haven't >> missed something obvious and that the new behaviours are sane. >> >> A FAILOVER in 2.2 no longer involves that 'faked event' from the old >> origin, The changes in 2.2 also allow you to specify multiple failed >> nodes as arguments to the FAILOVER command. The hope is that it >> addresses the issues Jan alludes to with multiple failed nodes. > > Interesting, but even more difficult to test in our environment for > reasons I cannot really go into on a public list. > > Thanks for the reply. > > Joe >
- Previous message: [Slony1-hackers] Failover never completes
- Next message: [Slony1-hackers] Failover never completes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-hackers mailing list