[Slony1-general] Drop Node command works, but leaves crumbs behind.

Mon May 10 15:07:25 PDT 2010

Hi all,
    I've been running into a problem with dropping a node from the slony 
cluster, in which the slony system catalogs aren't getting fully cleaned 
up upon the dropping of the node.

    I have a three node cluster, one master and two slaves. I have a 
script that will generate the slonik command that will drop one of the 
slaves (in this case node three) from the slony cluster and it executes 
without problem. However, after preforming the drop node a few dozen 
times, there have been several instances in which the data in 
_slony.sl_status still refers to a third node, and the st_lag_num_events 
climb and climb (since there's no node to sync with, it will never drop 
to 0).

So the problem is after I drop a node, everything looks great except for 
the _slony.sl_status table, in any or all the remaining nodes, still 
refers to the node that was just dropped.

I did quite a few test runs of the drop node to try to reproduce and 
determine the cause. After the drop node, if I look in sl_node, sl_path, 
sl_event, or any other sl_ location, I see no reference to the third 
node. However, about half the time I would still get references to the 
third node in sl_status. This can either be on the master node, or the 
(remaining) slave node, or both. There was one test scenario that I 
monitored the sl_status table and noticed that node 3 disappeared, then 
reappeared a second later, then remained.

Example queries done on node 2 (slave) after dropping node 3 (other slave):

postgres=# select * from _slony.sl_node;
 no_id | no_active | no_comment | no_spool
-------+-----------+------------+----------
     1 | t         | Server 1   | f
     2 | t         | Server 2   | f
(2 rows)

postgres=# select * from _slony.sl_path ;
 pa_server | pa_client |                        
pa_conninfo                         | pa_connretry
-----------+-----------+------------------------------------------------------------+--------------
         1 |         2 | dbname=postgres host=172.16.44.111 port=5432 
user=postgres |           10
         2 |         1 | dbname=postgres host=172.16.44.129 port=5432 
user=postgres |           10
(2 rows)

postgres=# select * from _slony.sl_status;
 st_origin | st_received | st_last_event |      st_last_event_ts      | 
st_last_received |    st_last_received_ts     | 
st_last_received_event_ts  | st_lag_num_events |   st_lag_time
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------
         2 |           1 |          1649 | 2010-05-10 15:53:16.245529 
|             1649 | 2010-05-10 15:53:16.246212 | 2010-05-10 
15:53:16.245529 |                 0 | 00:00:05.57205
         2 |           3 |          1656 | 2010-05-10 15:54:26.280131 
|             1636 | 2010-05-10 15:51:05.341512 | 2010-05-10 
15:51:05.343754 |                20 | 00:03:22.66664

Also, another problem that may be linked is the fact that the slon 
daemon for node 3 does not terminate itself after it. Watching the log 
output by that daemon, it shows that it recieves the drop node command 
for itself, and it drops the _slony schema as intended. However after 
that it reports "2010-05-10 15:57:56 MDT FATAL  main: Node is not 
initialized properly - sleep 10s" and keeps checking every ten seconds. 
I'm not sure if somehow this daemon is causing some post-drop-node 
entries into the sl_event section that causes the sl_status entry to be 
recreated.

In case it helps, here is a copy of the drop node script I'm running.

#!/bin/bash
slonik <<_EOF_
cluster name = slony;
node 1 admin conninfo = ' dbname=postgres host=172.16.44.111 port=5432 
user=postgres';
node 2 admin conninfo = ' dbname=postgres host=172.16.44.129 port=5432 
user=postgres';
node 3 admin conninfo = ' dbname=postgres host=172.16.44.142 port=5432 
user=postgres';
DROP NODE ( ID = 3, EVENT NODE = 1 );
_EOF_

I am running on a CentOS 5, postgres 8.4.2, and slony 1.2.20 environment 
on all three nodes.

Thanks in advance,
    Brian Fehrle