[Slony1-hackers] automatic WAIT FOR proposal

Wed Dec 22 07:45:48 PST 2010

On 10-12-10 04:54 PM, Christopher Browne wrote:
> Steve Singer<ssinger at ca.afilias.info>  writes:

>> 2) When a event is created on node (b) if we store the fact that it has
>> already seen/confirmed event (a),1234 from node (a) we could transmit
>> this pre-condition as part of the event so node (c) can know that it
>> can't process the event from b until it has seen 1234 from (a).  This
>> way node (c) will process things in the right order but we can submit
>> events to (b) - which is up to date without having to wait for the busy
>> node (c) to get caught up.

>>
>> The approach proposed here is to go with (4) where before switching
>> event nodes slonik will WAIT FOR all nodes to confirm the last event
>
> Related to #2...  We might introduce a new event that tries to
> coordinate between nodes.
>
> In effect, a "WAIT FOR EVENT" event...
>
>    So, we submit, against node #1, WAIT_FOR_EVENT (2,355).
>
>    The intent of this event is that processing of the stream of events
>    for node #1 holds back until it has received event #355 from node #2.
>
> That doesn't mandate waiting for *EVERY* node, just one node.  Multiple
> WAIT FOR EVENT requests could get you a "wait on all."  Note that this
> is on the slon side, not so much the slonik side...

I will write a design proposal that explores what things might look like 
if we went that route.

>
>> 1) STORE PATH - the event node is dictated by how you are setting up the
>> path. Furthoremore if the backwards path isn't yet set up the node won't
>> recive the confirm message
>
>    There's an argument to be made that STORE PATH perhaps should be going
>    directly to nodes, and doesn't need to be involved in event
>    propagation.  It's pretty cool to propagate STORE PATH requests
>    everywhere, but it's not hugely necessary.
>
>    ...[erm, rethinking]...
>
>    The conninfo field only ever matters on the node where it is used.
>    But computation of listen paths requires that all nodes have the
>    [from,to] data.  So there's a partial truth there.  conninfo isn't
>    necessary, but [from,to] is...

Exactly we need to propogate the [from,to] information.  As long as we 
are doing that we might as well also store the conninfo (it doesn't cost 
or hurt doing so)

>
>> 2) SUBSCRIBE set (in 2.0.5+) always gets submitted at the origin.  So if
>> you are subscribing multiple sets slonik will switch event nodes. This
>> means that subscribing to multiple sets (with different set origins) in
>> parallel will be harder (you will need to disable automatic wait-for or
>> use different slonik invocations). You can still do parallel subscribes
>> to the same set because the subscribe set always goes to the origin in
>> 2.0.5+ not the provider or the receiver.
>
>    I have always been a little uncomfortable about this change, and this
>    underlines that discomfort.  But that doesn't mean I'm right...

There are three ways this could work
1) The event can be submitted to the provider.  This is how it works in 
2.0.4 and at least 1.2.x.   The problem with this is that the 
ENABLE_SUBSCRIPTION and the SUBSCRIBE_SET events can arrive at the 
receiver out of order because the time it takes to go from 
provider->origin->receiver might be faster than it takes for the message 
to get directly from the receiver->provider. Doing this was problematic.

2) You can submit the subscribe set to the origin as happens in 2.0.5. 
If there are problems with this approach I'd like to hear about them.

3) You can submit the event to the receiver.  Jan says this is how he 
originally intended things to work in the original slony design. I'm not 
opposed to this per-say but I feel we should have a more concrete reason 
for changing this than gut feel.  If we are going to change this I'd 
rather we did it before the WAIT FOR stuff.

>
>> 4) CREATE/DROP SET must go to the set origin. If your creating sets the
>> cluster probably needs to be caught up.
>
>    And if these events are lost, due to a FAILOVER partition or such,
>    if they were only in the partition of the cluster that was lost, it
>    doesn't matter...

You mentioned your justification for this statement when we were talking 
about this, for the benefit of people on the list:  Unlike SYNC events 
normal commands are stored in the sl_event of non-forwarder nodes.  This 
means that if the CREATE SET escapes the network parition to at least 
one node then it can eventually get propagated to the rest of the nodes 
on that side of the divide.

>
>> 5) ADD TABLE/ADD SEQUENCE - must go to the origin.  Again if your
>> manipulating sets you must stick to a single set origin or have your
>> cluster be caught up
>> 6) MOVE TABLE goes to the origin - but the docs already warn you about
>> trying this if your cluster isn't caught up (with respect to this set)
>> 8) MOVE SET - Doing this with a behind cluster is already a bad idea
>> 9) FAILOVER - See multi-node failover discussion
>
> There's a mix of needful semantics here.
>
> For instance, SET ADD TABLE/SEQUENCE only forcibly need to propagate
> alongside the successful propagation of subscriptions to those sets.

I'm not convinced this is true.  Consider the following case:

set 1 with origin 1.
set 2 with origin 2.

set add table(id=1,set id=1, origin=1,fully qualified 	name('public.foo');
set add table(id=2, set id=2,origin=2,fully qualified name='public.foo');

setAddTable_int checks for this condition but the check will only work 
if the SET ADD TABLE submitted on node 1 gets propogated to node 2, even 
though node 2 might never become a subscriber to set 1.

I will point out that based on the rules for automatic wait for that I 
defined in this proposal, slonik will wait for the first add table to be 
propogated to node 2 before doing the second add table (since the origin 
is different).  I think this is the correct behavior.

>
> That's different from the propagation needs for other events.  It seems
> to me that we might want to classify the "propagation needs"; if there
> are good names for the different classifications, then we're likely
> really onto something.

>
> Good names aren't arriving to me on Friday afternoon :-).
>
>> STORE PATH
>> -----------
>> A WAIT FOR ALL nodes won't work unless all of the paths are stored.
>> When I say 'all' I mean there must exist a route from every node to
>> every other node.  The routes don't need to be direct. There are certain
>> common usage patterns that shouldn't be excluded. It would be good if
>> slonik could detect missing paths before 'changing things' because
>> otherwise users might be left with a half complete script.
>
> I'd classify this two ways:
>
> a) When bootstrapping a cluster, WAIT FOR ALL can't work if there aren't
> enough paths yet.
>
> I'm not sure it makes sense to go to the extent of computing spanning
> trees or such to validate this.
>
> If we try to validate at every point, then you can't have a sequence
> of...
>
>    Set up all nodes...
>     INIT CLUSTER
>     STORE NODE
>     STORE NODE
>     STORE NODE
>
>    Then, set up paths...
>     STORE PATH
>     STORE PATH
>     STORE PATH
>     STORE PATH
>
> It seems like a logical idea to construct a cluster by setting up all
> nodes, then to set up communications between them.  It doesn't thrill me
> if we make that impossible.
>

Is store node an exception to the rule then?

Today what happens if you have a script like:

init cluster(id=1)
store node(id=2, event node=1);
store node(id=3, event node=1);
store node(id=3, event node=2);

The last command will fail because when slonik tries to connect to node 
3 it will determine that the slony schema already exists on node 3. 
However the event node (2) won't actually know that node 3 has already 
been created (since the event from 1 won't have propogated because no 
paths exist yet)

>> The easy answer is: Don't write scripts that can leave your cluster in
>> an indetermined state.   What we should do if someone tries is an open
>> question.  We could a) check that all code paths (cross product) leave
>> the cluster consistent/complete.  b) Assume the try blocks always finish
>> successfully c) don't do the parse tree analysis described above for the
>> entire script at parse time but instead do it for each block before
>> entering that block.
>> I am leaning towards c.
>
> If we're going down a "prevent nondetermined states" road, then it seems
> to me there needs to be a presentation of a would-be algebra of cluster
> states so we can talk about this analytically.
>
> I think having that algebra is a prerequisite to deciding between any of
> those alternatives.
>

That sounds like a good idea.
When I write that alternate design on using slon side waits I will see 
if I can develop something.

>> - How do we want to handle TRY blocks. See discussion above
>
> WAIT FOR and TRY are right well incompatible with each other, unless we
> determine, within the algebra, that there is some subset of commands
> that make state changes that we consider don't need to be guarded by
> WAIT FOR that are permissible in a TRY block.