[Slony1-general] Working out who is master after failover

Mon May 2 21:10:58 PDT 2005

Julian Scarfe wrote:

> For cluster T1 configured as simple master/slave I've been using:
>
> select "_T1".getlocalnodeid('_T1') = (select set_origin from
> "_T1".sl_set where set_id = 1)
>
> to give me a boolean result for whether I'm on the master node for set
> 1. If there's a better way I'd like to know, but it seems to work fine.
>
> However, after a failover, there's a problem.  Both nodes think
> they're the master node according to the criteria above, because the
> old master hasn't received the news of his demotion.  Is there a
> simple way of working out on the "abandoned" node that it has indeed
> been abandoned?
>
> In a sense, I realise it should never come to that:  if I can connect
> to the database on the old master, I should have used "move set" not
> "failover", and if I can't connect, I might as well pull the plug on
> it.  But there are situations in which the old master node might
> become available again before I get a chance to reconfigure it, so I'd
> like the client to be able to tell that it has been abandoned and that
> it should use a different node.

There isn't an easy answer to this one for the reason that it is
difficult to predict where things will stand at the point at which the
node is treated as 'destroyed.'

Consider some scenarios:

1.  Node #1 is in Ottawa; other nodes are in Toronto.  Failover due to
persistent network failure.

The network falls over, and we decide that the Ottawa data source must
be abandoned.  The database host and its database is undamaged, and
since we had no way to communicate with node #1 that it got stepped on,
it thinks it's running fine.

Note that in this case, no data was ever corrupted in any way.

Note also that since the network was dead, we had no way to tell node #1
that it is has been abandoned.

Supposing there are some client machines in Ottawa, they might be able
to talk to node #1 even after it is abandoned, as they were on the
subnet there.  Could be trouble...

2.  Three nodes at one site.  Node #1 is on a machine that physically
catches fire.

Disk drives are turned into a "more oxidized" set of aluminium oxide. 
There's no more database there; node #1 will never run again.

There's nothing to be done on the Slony-I side to former node #1.  Your
hazmat team might have a job to do...

3.  Node #1 gets a speck of dust in a disk drive casing, destroying some
data.

In this case, we fail over, and want to leave things as untouched as
possible.  There is the chance that some analysis might be able to
extract the last 8 seconds worth of updates that hadn't gotten
replicated elsewhere.

You're pretty clearly talking about scenario #1, where, to a client that
isn't talking to the rest of the network, there may be no reason to
think that node #1 has been abandoned.

I would be reluctant to try to "automatically" ship a message to the
failing node because of scenario #3.  If we have some failing disk
hardware, I want to leave it well enough alone.  There is the (however
remote) chance that there may be data to be recovered, and every
modification made once the hardware is damaged increases the chances
that further data will be lost that we cannot recover.

The particular  problem that we see in scenario #1 is that some client
might not realize that node #1 is supposed to be dead.   Supposing
connectivity between Toronto and Ottawa died at 1pm and was not
reestablished until 9pm, then during that 8 hours, node #1 is "live" as
far as it is concerned and there is no way to tell it that it shouldn't
think so.  At 9pm, we get some options, but it is by no means automatic...