[Slony1-hackers] WAIT FOR: DROP NODE or STORE NODE

Wed Mar 2 11:12:13 PST 2011

On 11-03-02 09:25 AM, Steve Singer wrote:
> Consider a three node cluster with nodes 1,2 and 3.
>
> I then issue the command.
>
> store node(id=4, event node=1);
> subscribe set(id=1,provider=2,receiver=4);
>
> Slonik needs to wait for the store node to propogate to node 2 before
> doing the subscribe set. The problem is that node 2 does not yet have an
> sl_node entry for node 4.
>
> Slonik can't tell the difference between this case and
>
> store node(id=4,event node=1);
> #wait
> store node(id=5,event node=4);
> drop node(id=4,event node=2);
> subscribe set(id=1,provider=2,receiver=4);
>

>
> 3) We could have slonik check sl_node on all other nodes to see if the
> node is active anywhere else. If it is we wait, if at some point the
> node is deleted from sl_node on all nodes then slonik can assume that it
> is a drop node. I don't like this because it would involve slonik
> constantly polling all nodes in a cluster (during a wait) if it can't
> find an sl_node entry on the node it is waiting on but if it only did
> this when a node entry was missing maybe it isn't so bad.
>

The other thing to think about is that there might be non-sync events 
that originated on the dropped node that made it to at least 1 other 
node but not everywhere.   How do we know about those events? They need 
to be processed everywhere in the correct order before the dropped node 
gets processed.

If a non origin 'fails' we still might have to do a subset of the 
proposed failover logic to make sure non-sync events get where they need to.

How much of this problem can we solve by borrowing from the new failover 
code?

> Are other issues going to force us to implement approach 2 for 2.1 (ie
> something in multi-node failover).
>
> Also 3 is pretty incompatible with the slon side failover but 2 is a
I meant slon side 'wait for' in the above line.

> chunk of work. (actually if make the existing no_id columns in sl_node,
> sl_event,sl_confirm and all the stored procedures refer to a uuid and
> then make the slonik commands map the id=blah to the new column we add
> to sl_node then this might not be so bad but a bit confusing in that a
> node id in a slonik command refers to a different column than sl_node.no_id
>
> Thoughts?
>