[Slony1-general] Failover Stalls

Wed Sep 28 20:20:12 PDT 2005

stephen.hindmarch at bt.com wrote:

>I have set up Slony-1 v 1.1.0 on two servers, each with identical
>databases, and have organised replication between the two of them.
>
>I can get a switchover to work, but when I do a failover the failover
>script stops in the middle of the failover command.
>
>Here is an extra from the script. The variables are set by the rest of
>the script, nodeId is the name of the subscriber, remoteId is the name
>of the origin, and the idea of the script is to run it on the surviving
>server after something nasty has happened to the other server.
>
>log "Attempting failover to local node ($nodeId)"
>log `date`
>slonik <<EOF
>    cluster name = $CLUSTER_NAME;
>    node 1 admin conninfo = '$one_conninfo';
>    node 2 admin conninfo = '$two_conninfo';
>    echo 'Failing over to node $nodeId';
>    failover ( id=$remoteId, backup node = $nodeId);
>    echo 'Failover complete';
>EOF
>
>In my test scenario, node 2 is the origin. I kill the postmaster on node
>2 to simulate the server dying a horrible death. The slon daemon on node
>2 dies and the slon daemon on node 1 starts to complain of being unable
>to access the node 2 database (I've x'd out the true IP address) :-
>
>2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2: forward confirm
>1,9841 received by 2
>2005-09-27 14:33:40 BST ERROR  remoteListenThread_2: "select ev_origin,
>ev_seqno, ev_timestamp,        ev_minxid, ev_maxxid, ev_xip,
>ev_type,        ev_data1, ev_data2,        ev_data3, ev_data4,
>ev_data5, ev_data6,        ev_data7, ev_data8 from "_dot_ha".sl_event e
>where (e.ev_origin = '2' and e.ev_seqno > '26') order by e.ev_origin,
>e.ev_seqno" - server closed the connection unexpectedly
>	This probably means the server terminated abnormally
>	before or while processing the request.
>2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
>9842
>2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event 1,9842
>SYNC
>2005-09-27 14:33:50 BST ERROR  slon_connectdb: PQconnectdb("dbname=DOT
>host=xxx.xxx.xxx.xxx user=postgres") failed - could not connect to
>server: Connection refused
>	Is the server running on host "xxx.xxx.xxx.xxx" and accepting
>	TCP/IP connections on port 5432?
>2005-09-27 14:33:50 BST WARN   remoteListenThread_2: DB connection
>failed - sleep 10 seconds
>
>At this point I now have no origin, but a working subscriber, up to
>date, at least up to the time of the last synch.
>
>I want to make this subscriber the new origin. A switchover won't work
>so I execute the above script on node 1 and I get the following output:-
>
>ha-failover.sh: Attempting failover to local node (1)
>ha-failover.sh: Tue Sep 27 14:34:17 BST 2005
><stdin>:4: Failing over to node 1
><stdin>:5: NOTICE:  failedNode: set 1 has no other direct receivers -
>move now
>
>And then the script just hangs there (I've left it running for over an
>hour). It seems to be stuck on the failover line as it never reaches the
>second echo statement.
>  
>
That is somewhat curious.  I'll see if I can see why that would be.

"Wild speculation" (which is no more valuable than "speculative gossip")
would be that perhaps it's waiting to tell all the remaining subscribers
something, and since there aren't any, there's something confused about
that.

Your scenario here is one where it would be about as useful to simply do
an UNINSTALL NODE on node 1, because once the FAILOVER is done, there
will be nothing other than node 1 in the cluster.  With no subscribers,
the presence of replication is pretty well a "historical curiosity."

Under such a circumstance, with two nodes, and the master dead, I'd be
inclined to simply drop replication, as, with only one node, you don't
honestly have replication going on anymore...

>When I look at the node 1 database I see that I can now update the
>replicated tables, so node 1 now thinks it is the master. I can check
>this by inspecting sl_set and see the origin for my replication set is
>now node 1. The sl_subscribe table is empty. The sl_node table shows
>both nodes and both are active, which strikes me as suspicious.
>
>DOT=# select * from _dot_ha.sl_node;
> no_id | no_active |   no_comment    | no_spool
>-------+-----------+-----------------+----------
>     1 | t         | Node One        | f
>     2 | t         | Node Two        | f
>(2 rows)
>
>  
>
This actually *isn't* suspicious.  This is normal.

FAILOVER doesn't actually drop out the failed node.

Dropping a node has, alas, side-effects, notably purging out information
about the events coming from that node.  That would add insult to injury
supposing we had a node 3 that was more up to date than node 1.

We would then find ourselves in the regrettable position where we knew
node 3 had some better data, but have no way to properly apply it to
node 1 to get it up to speed.  That would essentially add insult to
injury; node 3 was in better shape, but we would have to drop it, too,
because there's no way to get at its data :-(.

Anyhoo, node 2 won't go away until you explicitly drop it.  Which should
wait until the reformed cluster is working OK...

And as for sl_subscribe, well, there is no longer any subscriber to set
1.  Node #1 is the only node still working; nothing is subscribing to
it.  The emptiness of sl_subscribe is just fine.

>There is only one line in the slon daemon log that is of significance at
>the moment of the failover:-
>
>2005-09-27 14:34:10 BST WARN   remoteListenThread_2: DB connection
>failed - sleep 10 seconds
>2005-09-27 14:34:17 BST INFO   localListenThread: got restart
>notification - signal scheduler
>2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
>9845
>
>Can anybody gives me any clues as to what is going on?
>  
>
It seems to me as though everything is actually OK.

You'll want to drop node 2...