[Slony1-general] Failover Stalls

Tue Sep 27 14:54:18 PDT 2005

I have set up Slony-1 v 1.1.0 on two servers, each with identical
databases, and have organised replication between the two of them.

I can get a switchover to work, but when I do a failover the failover
script stops in the middle of the failover command.

Here is an extra from the script. The variables are set by the rest of
the script, nodeId is the name of the subscriber, remoteId is the name
of the origin, and the idea of the script is to run it on the surviving
server after something nasty has happened to the other server.

log "Attempting failover to local node ($nodeId)"
log `date`
slonik <<EOF
    cluster name = $CLUSTER_NAME;
    node 1 admin conninfo = '$one_conninfo';
    node 2 admin conninfo = '$two_conninfo';
    echo 'Failing over to node $nodeId';
    failover ( id=$remoteId, backup node = $nodeId);
    echo 'Failover complete';
EOF

In my test scenario, node 2 is the origin. I kill the postmaster on node
2 to simulate the server dying a horrible death. The slon daemon on node
2 dies and the slon daemon on node 1 starts to complain of being unable
to access the node 2 database (I've x'd out the true IP address) :-

2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2: forward confirm
1,9841 received by 2
2005-09-27 14:33:40 BST ERROR  remoteListenThread_2: "select ev_origin,
ev_seqno, ev_timestamp,        ev_minxid, ev_maxxid, ev_xip,
ev_type,        ev_data1, ev_data2,        ev_data3, ev_data4,
ev_data5, ev_data6,        ev_data7, ev_data8 from "_dot_ha".sl_event e
where (e.ev_origin = '2' and e.ev_seqno > '26') order by e.ev_origin,
e.ev_seqno" - server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
9842
2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event 1,9842
SYNC
2005-09-27 14:33:50 BST ERROR  slon_connectdb: PQconnectdb("dbname=DOT
host=xxx.xxx.xxx.xxx user=postgres") failed - could not connect to
server: Connection refused
	Is the server running on host "xxx.xxx.xxx.xxx" and accepting
	TCP/IP connections on port 5432?
2005-09-27 14:33:50 BST WARN   remoteListenThread_2: DB connection
failed - sleep 10 seconds

At this point I now have no origin, but a working subscriber, up to
date, at least up to the time of the last synch.

I want to make this subscriber the new origin. A switchover won't work
so I execute the above script on node 1 and I get the following output:-

ha-failover.sh: Attempting failover to local node (1)
ha-failover.sh: Tue Sep 27 14:34:17 BST 2005
<stdin>:4: Failing over to node 1
<stdin>:5: NOTICE:  failedNode: set 1 has no other direct receivers -
move now

And then the script just hangs there (I've left it running for over an
hour). It seems to be stuck on the failover line as it never reaches the
second echo statement.

When I look at the node 1 database I see that I can now update the
replicated tables, so node 1 now thinks it is the master. I can check
this by inspecting sl_set and see the origin for my replication set is
now node 1. The sl_subscribe table is empty. The sl_node table shows
both nodes and both are active, which strikes me as suspicious.

DOT=# select * from _dot_ha.sl_node;
 no_id | no_active |   no_comment    | no_spool
-------+-----------+-----------------+----------
     1 | t         | Node One        | f
     2 | t         | Node Two        | f
(2 rows)

There is only one line in the slon daemon log that is of significance at
the moment of the failover:-

2005-09-27 14:34:10 BST WARN   remoteListenThread_2: DB connection
failed - sleep 10 seconds
2005-09-27 14:34:17 BST INFO   localListenThread: got restart
notification - signal scheduler
2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
9845

Can anybody gives me any clues as to what is going on?

Thanks

Steve Hindmarch
BT Exact