[Slony1-general] Failover failures

Tue Sep 6 22:28:34 PDT 2005

On Monday 22 August 2005 17:12, elein wrote:
> Slony 1.1.  Three nodes. 10 set(1) => 20 => 30.
>
> I ran failover from node10 to node20.
>
> On node30, the origin of the set was changed
> from 10 to 20, however, drop node10 failed
> because of the row in sl_setsync.
>
> This causes slon on node30 to quit and the cluster to
> become unstable.  Which in turn prevents putting
> node10 back into the mix.
>
> Please tell me I'm not the first one to run into
> this...
>
> The only clean work around I can see is to drop
> node 30. Re-add it. And then re-add node10.  This
> leaves us w/o a back up for the downtime.
>
>
> This is what is in some of the tables for node20:
>
> gb2=# select * from sl_node;
>  no_id | no_active |       no_comment        | no_spool
> -------+-----------+-------------------------+----------
>     20 | t         | Node 20 - gb2 at localhost | f
>     30 | t         | Node 30 - gb3 at localhost | f
> (2 rows)
>
> gb2=# select * from sl_set;
>  set_id | set_origin | set_locked |     set_comment
> --------+------------+------------+----------------------
>       1 |         20 |            | Set 1 for gb_cluster
> gb2=# select * from sl_setsync;
>  ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
> ssy_action_list
> -----------+------------+-----------+------------+------------+---------+--
>--------------- (0 rows)
>
> This is what I have for node30:
>
> gb3=# select * from sl_node;
>  no_id | no_active |       no_comment        | no_spool
> -------+-----------+-------------------------+----------
>     10 | t         | Node 10 - gb at localhost  | f
>     20 | t         | Node 20 - gb2 at localhost | f
>     30 | t         | Node 30 - gb3 at localhost | f
> (3 rows)
>
> gb3=# select * from sl_set;
>  set_id | set_origin | set_locked |     set_comment
> --------+------------+------------+----------------------
>       1 |         20 |            | Set 1 for gb_cluster
> (1 row)
>
> gb3=# select * from sl_setsync;
>  ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
> ssy_action_list
> -----------+------------+-----------+------------+------------+---------+--
>--------------- 1 |         10 |       235 | 1290260    | 1290261    |      
>   | (1 row)
>
> frustrated,
> --elein
Elein,
I can share your frustration, I have just for the first time started to 
investigate failover and I have yet to be able to have a clean failover 
happen, no matter how I do a failover I end up with nodes that are no longer 
in sync with other the nodes.  My time is fairly short this week, but I hope 
to be able to spend some time on it. I've pushed all my other slony work to 
the back burner to come to a solid resolution to this.

Jan/Chris are either of you able to reproduce stable failovers in a multi node 
(more than a single origin/subscriber pair) ?

> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general

-- 
Darcy Buskermolen
Wavefire Technologies Corp.

http://www.wavefire.com
ph: 250.717.0200
fx: 250.763.1759