Mon May 2 21:10:58 PDT 2005
- Previous message: [Slony1-general] Working out who is master after failover
- Next message: [Slony1-general] Working out who is master after failover
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Julian Scarfe wrote: > For cluster T1 configured as simple master/slave I've been using: > > select "_T1".getlocalnodeid('_T1') = (select set_origin from > "_T1".sl_set where set_id = 1) > > to give me a boolean result for whether I'm on the master node for set > 1. If there's a better way I'd like to know, but it seems to work fine. > > However, after a failover, there's a problem. Both nodes think > they're the master node according to the criteria above, because the > old master hasn't received the news of his demotion. Is there a > simple way of working out on the "abandoned" node that it has indeed > been abandoned? > > In a sense, I realise it should never come to that: if I can connect > to the database on the old master, I should have used "move set" not > "failover", and if I can't connect, I might as well pull the plug on > it. But there are situations in which the old master node might > become available again before I get a chance to reconfigure it, so I'd > like the client to be able to tell that it has been abandoned and that > it should use a different node. There isn't an easy answer to this one for the reason that it is difficult to predict where things will stand at the point at which the node is treated as 'destroyed.' Consider some scenarios: 1. Node #1 is in Ottawa; other nodes are in Toronto. Failover due to persistent network failure. The network falls over, and we decide that the Ottawa data source must be abandoned. The database host and its database is undamaged, and since we had no way to communicate with node #1 that it got stepped on, it thinks it's running fine. Note that in this case, no data was ever corrupted in any way. Note also that since the network was dead, we had no way to tell node #1 that it is has been abandoned. Supposing there are some client machines in Ottawa, they might be able to talk to node #1 even after it is abandoned, as they were on the subnet there. Could be trouble... 2. Three nodes at one site. Node #1 is on a machine that physically catches fire. Disk drives are turned into a "more oxidized" set of aluminium oxide. There's no more database there; node #1 will never run again. There's nothing to be done on the Slony-I side to former node #1. Your hazmat team might have a job to do... 3. Node #1 gets a speck of dust in a disk drive casing, destroying some data. In this case, we fail over, and want to leave things as untouched as possible. There is the chance that some analysis might be able to extract the last 8 seconds worth of updates that hadn't gotten replicated elsewhere. You're pretty clearly talking about scenario #1, where, to a client that isn't talking to the rest of the network, there may be no reason to think that node #1 has been abandoned. I would be reluctant to try to "automatically" ship a message to the failing node because of scenario #3. If we have some failing disk hardware, I want to leave it well enough alone. There is the (however remote) chance that there may be data to be recovered, and every modification made once the hardware is damaged increases the chances that further data will be lost that we cannot recover. The particular problem that we see in scenario #1 is that some client might not realize that node #1 is supposed to be dead. Supposing connectivity between Toronto and Ottawa died at 1pm and was not reestablished until 9pm, then during that 8 hours, node #1 is "live" as far as it is concerned and there is no way to tell it that it shouldn't think so. At 9pm, we get some options, but it is by no means automatic...
- Previous message: [Slony1-general] Working out who is master after failover
- Next message: [Slony1-general] Working out who is master after failover
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list