[Slony1-general] Failover failures

Tue Aug 23 16:40:49 PDT 2005

elein wrote:

>Slony 1.1.  Three nodes. 10 set(1) => 20 => 30.
>
>I ran failover from node10 to node20.
>
>On node30, the origin of the set was changed 
>from 10 to 20, however, drop node10 failed
>because of the row in sl_setsync.
>
>This causes slon on node30 to quit and the cluster to 
>become unstable.  Which in turn prevents putting
>node10 back into the mix.
>
>Please tell me I'm not the first one to run into
>this...
>
>The only clean work around I can see is to drop
>node 30. Re-add it. And then re-add node10.  This
>leaves us w/o a back up for the downtime.
>
>
>This is what is in some of the tables for node20:
>
>gb2=# select * from sl_node;
> no_id | no_active |       no_comment        | no_spool
>-------+-----------+-------------------------+----------
>    20 | t         | Node 20 - gb2 at localhost | f
>    30 | t         | Node 30 - gb3 at localhost | f
>(2 rows)
>
>gb2=# select * from sl_set;
> set_id | set_origin | set_locked |     set_comment
>--------+------------+------------+----------------------
>      1 |         20 |            | Set 1 for gb_cluster
>gb2=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
>-----------+------------+-----------+------------+------------+---------+-----------------
>(0 rows)
>
>This is what I have for node30:
>
>gb3=# select * from sl_node;
> no_id | no_active |       no_comment        | no_spool
>-------+-----------+-------------------------+----------
>    10 | t         | Node 10 - gb at localhost  | f
>    20 | t         | Node 20 - gb2 at localhost | f
>    30 | t         | Node 30 - gb3 at localhost | f
>(3 rows)
>
>gb3=# select * from sl_set;
> set_id | set_origin | set_locked |     set_comment
>--------+------------+------------+----------------------
>      1 |         20 |            | Set 1 for gb_cluster
>(1 row)
>
>gb3=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
>-----------+------------+-----------+------------+------------+---------+-----------------
>         1 |         10 |       235 | 1290260    | 1290261    |         |
>(1 row)
>
>frustrated,
>--elein
>_______________________________________________
>Slony1-general mailing list
>Slony1-general at gborg.postgresql.org
>http://gborg.postgresql.org/mailman/listinfo/slony1-general
>  
>
That error message in your other email was VERY helpful in pointing at
least at what clues to look for...

I /think/ that the FAILOVER_SET event hasn't yet been processed on node
30, which would be consistent with everything we see.

Can you check logs on node 30 or sl_event on node 30 to see if
FAILOVER_SET has made it there?

What seems not to have happened is for the FAILOVER_SET event to process
on node 30; when that *does* happen, it would delete out the sl_setsync
entry pointing to node 10 and create a new one pointing to node 20.  
(This is in the last 1/2 of function failoverSet_int().)

I'll bet that sl_subscribe on node 30 is still pointing to node 10; that
would be further confirmation that FAILOVER_SET hasn't processed on node 30.

If that event hasn't processed, then we can at least move the confusion
from being:
 "Help!  I don't know why the configuration is so odd on node 30!"
to
 "Hmm.  The config is consistent with FAILOVER not being done yet.  What
prevented the FAILOVER_SET event from processing on node 30?"

We're not at a full answer, but the latter question points to a more
purposeful search :-).