[Slony1-hackers] automatic WAIT FOR proposal

Fri Dec 10 06:19:09 PST 2010

This an initial proposal describes an approach to dealing with automatic 
wait for in slony 2.1.

Overview
--------------
In current (<=2.0) versions of slony some events are submitted to an 
event node by slonik.  These events (that exist as rows in sl_event) get 
propogated to other nodes via replication (slon).  When these events are 
processed on the other node the other node runs the code (usually a 
stored procedure) associated with the event.  Slonik thinks it is 
finished with the command as soon as the event is created and the stored 
procedure that slonik called has completed.

Just because slonik has moved onto the next line of the script doesn't 
mean that the previous command has yet been processed everywhere.   To 
address this issue slonik has the 'WAIT FOR' command where you can tell 
slonik to wait until the last event submitted a particular node has been 
confirmed by other nodes.

Properly written slonik scripts will 'WAIT FOR' some events to be 
processed on other nodes before trying different things.

The Problem
----------------
The problem with the above statement is that it is very imprecise. It 
has never been clearly defined when you need to wait and against which 
node.  Furthermore the answer to that question differs based on the 
version of slony you are using because it is dependent on the internal 
operation of slony.

An informal survey of the slony mailing list shows that almost no users 
understand how WAIT FOR should be used.

Proposal
------------
The goal is to make slonik handle the proper waiting between events. If 
based on the previous set of slonik commands the next command needs to 
wait until a previous command is confirmed by another node then slonik 
should be smart enough to figure this out and to wait.

When should slonik wait?

Slonik can connect to any node in the cluster (or at least any one it 
has an admin conninfo for) to submit commands against.  If in a single 
slonik script, slonik submits a command against a node (say node (a) ) 
and then later submits a command against a different node (say node (b)) 
then it would be a really good idea to ensure that any event generated 
by the last command on node (a) has been processed on node (b) before 
slonik submits a command to node (b).

If slonik were to submit the command to node (b) before the previous 
command reaches (b) then you might have different behaviour than if the 
command were to be submitted to (b) after the previous event reaches (b).

Question: Does slonik have to wait until the event from node (a) has 
been submitted to all nodes or just node (b)?

To answer this we assume the answer is 'no'.  Let us have another node 
(c).  If the event from node (b) (say e2) reaches node (c) before the 
event from node (a) (event e1) then do we have a problem?

Consider the example of a user submitting a 'store node' command to a to 
create a new node (d).  Then the user submits a 'drop node' command to 
node (b).  If node drop node is processed on node (c) before the store 
node then we will have an issue.
So yes,  this is a problem.

Can this actually happen?

Say node (c) is in the middle of subscribing to a set directly from node 
(a).  The remote worker on (c) will not process e1 until after the 
subscription is finished.  However, the remote worker on (c) for node 
(b) will be able to see and process e2 earlier.

So the answer is 'yes', slonik must wait for the event to be confirmed 
by more nodes than just the ones involved in the commands.

part of the problem is that there is no total-ordering of events. Each 
event gets a sequence number with respect to the event node the event 
was created on.  There is no direct way of node (c) to be sequencing 
events from node (a) and events from node (b) in the original order.

What would solve this?

1) If we had a global ordering on events maybe assigned by a cluster 
coordinator node (c) would be able to process the events in the right order.
2) When a event is created on node (b) if we store the fact that it has 
already seen/confirmed event (a),1234 from node (a) we could transmit 
this pre-condition as part of the event so node (c) can know that it 
can't process the event from b until it has seen 1234 from (a).  This 
way node (c) will process things in the right order but we can submit 
events to (b) - which is up to date without having to wait for the busy 
node (c) to get caught up.
3) We could disallow or discourage the use of multiple event nodes and 
require all slonik command events to originate on a single cluster node 
(other than store path and maybe subscribe set) and provide facilities 
for dealing with cases where that event node fails or is split.
4) We really do require the cluster be caught up before using a 
different event node.  This is where we automatically do the WAIT FOR ALL.

The approach proposed here is to go with (4) where before switching 
event nodes slonik will WAIT FOR all nodes to confirm the last event

Implications/Drawbacks
--------------------------

If a user does invokes slonik multiple times ie 1 slonik invocation per 
command then this feature provides nothing.

If a node in the cluster is behind maybe because it is subscribing to a 
large set or has fallen behind by a lot then slonik commands need to be 
restricted to a single event node. We will explore the implications of 
this later on (below).

If a user starts a second slonik instance while the first one is waiting 
then they have no protection.  This is no worse than they are today

What are the implications of having to wait until the cluster is caught 
up before switching event nodes?  The following are cases where you 
might switch event/command nodes.

1) STORE PATH - the event node is dictated by how you are setting up the 
path. Furthoremore if the backwards path isn't yet set up the node won't 
recive the confirm message
2) SUBSCRIBE set (in 2.0.5+) always gets submitted at the origin.  So if 
you are subscribing multiple sets slonik will switch event nodes. This 
means that subscribing to multiple sets (with different set origins) in 
parallel will be harder (you will need to disable automatic wait-for or 
use different slonik invocations). You can still do parallel subscribes 
to the same set because the subscribe set always goes to the origin in 
2.0.5+ not the provider or the receiver.
3) STORE/DROP listen goes to specific nodes based on the arguments but 
you shouldn't need STORE/DROP listen commands anyway in 1.2 or 2.0
4) CREATE/DROP SET must go to the set origin. If your creating sets the 
cluster probably needs to be caught up.
5) ADD TABLE/ADD SEQUENCE - must go to the origin.  Again if your 
manipulating sets you must stick to a single set origin or have your 
cluster be caught up
6) MOVE TABLE goes to the origin - but the docs already warn you about 
trying this if your cluster isn't caught up (with respect to this set)
8) MOVE SET - Doing this with a behind cluster is already a bad idea
9) FAILOVER - See multi-node failover discussion

STORE PATH
-----------
A WAIT FOR ALL nodes won't work unless all of the paths are stored.
When I say 'all' I mean there must exist a route from every node to 
every other node.  The routes don't need to be direct. There are certain 
common usage patterns that shouldn't be excluded. It would be good if 
slonik could detect missing paths before 'changing things' because 
otherwise users might be left with a half complete script.

To do this slonik will:
-Connect to each event node and query sl_path and build an in-memory 
representation of the cluster configuration
-Walk the parse tree of the slonik script and find places where the 
event node changes.  At each of these places it will verify that the 
path network is complete.   If the path network is not complete slonik 
will throw an error.
-An implicit WAIT FOR ALL command will be inserted into the parse tree
-Nodes that have not yet been created (but have admin conninfo) data 
will not generate 'incomplete' errors.   This means that STORE NODE,
DROP NODE and FAILOVER events must be picked up during parse tree 
analysis and adjust sloniks internal state
-STORE PATH/DROP PATH commands also must be picked up by the parse tree
analysis.

Consider this slonik script:
-------------------------------
store node(id=2,event node=1)
try {
   subscribe set(set id=1,provider=1,receiver=2);
   store path( client=1,server=2,...);
   store path(client=2,server=1,....);
}
on error {
echo 'let us ignore errors';
}
create set(id=2,origin=2);
-------------------------------------

Is it valid?
If the subscribe set works then paths will exist for the create set 
(something that would require an implicit wait for to preceed it).

But if the subscribe set fails then the paths never get created and it 
is not possible do a wait for before the create set.

The easy answer is: Don't write scripts that can leave your cluster in 
an indetermined state.   What we should do if someone tries is an open 
question.  We could a) check that all code paths (cross product) leave 
the cluster consistent/complete.  b) Assume the try blocks always finish 
successfully c) don't do the parse tree analysis described above for the 
entire script at parse time but instead do it for each block before 
entering that block.
I am leaning towards c.

Subscription Reshaping
--------------------------
A common use-case is
1-->2--->3
and node 2 goes offline.  You might want to create a node 4 (to replace 
node 2) and make node 3 feed off node a until node 4 is caught up then 
have node 3 feed off node 4.  You think node (2) will come back up in a 
while so you don't want to drop it from the cluster.

subscribe set(id=1, provider=1,receiver=3);
create node(id=4,event node=4);
store path(client=4,server=3..);
store path(client=3,server=4...);
subscribe set(id=1,provider=1,receiver=4);

The above script will work and will not require any implicit WAIT FOR 
commands since the origin for the subscribe sets AND the store node is 
1.  It is also worth noting that the first subscribe set (the reshape) 
will also connect directly to node 3 and reconfigure the node 3 listen 
network to pay attention to events from node 1 directly.

Now consider the case where node 3 is also the origin for a set being 
replicated to node 2 and we now want to replicate it to node 4 as well.

subscribe set(id=1, provider=1,receiver=3);
create node(id=4,event node=4);
store path(client=4,server=3..);
store path(client=3,server=4...);
subscribe set(id=1,provider=1,receiver=4);
--An IMPLICT WAIT FOR will be inserted here
subscribe set(id=2,provider=3,receiver=4);

The above script won't ever complete (until node 2 comes back online) 
because the WAIT FOR ALL will be waiting for the events to be confirmed 
by node 2.

Solutions to this could be to say to users 'tough' execute that second 
subscribe set in another slonik....

How does this situation behave today:  It is pretty important the the 
subscribe set (id=2...) does not get processed until the create 
node(id=4) makes it to node 3.  The script as written above might not 
work.  You would need a explicit WAIT FOR before the subsribe set 
(id=2).   WAIT FOR(..confirmed=all) won't ever complete because node 2 
is behind.

If I were manually writing the script I could do
WAIT FOR(id=1, confirmed=3, ...).
The problem is today - that when node 2 comes back online it might see 
the SUBSCRIBE SET(id=2...) from node 3 BEFORE it learns about node 4.
Or is there some reason I don't see why this isn't a problem today? 
This also applies to sites that intentionally lag one of their slons by 
a few hours.  Automatic WAIT FOR is incompatible in practice with this 
use case but I don't see how they aren't exceptionally vunerable to race 
conditions today.

FAILOVER
---------------
Most failover scenarios are not done as part of an existing script but 
are scripts done with the 'failover' in mind.
To that end the failover script might involve adding some paths and then 
executing the failover command which shouldn't require any waiting

Because the STORE PATH for the path from the receiver to the backup node 
(if one is required) will be executed by slonik on the backup node then 
the failover path checking logic that runs on the backup node will 
always see that path (if the path was created) and no WAIT FOR is required.

cluster reshaping will typically happen after the failover command - 
because most cluster reshaping involves submitting events to the set 
origin (which is inaccessible before the failover).

After the failover commands to other nodes would require an implicit 
WAIT FOR.. ALL.  This is fine as long as the 'failed' node is marked as 
disabled so the ALL doesn't include the failed node (this does not 
happen today, to be safe oyu would have to manually include a wait for 
to each of the individual non failover target nodes).

Slonik will need to be modified to see the FAILOVER and know that means 
it should exclude the failed node from its list of 'all nodes to wait 
for' just like it will need to do for drop node.

Disabling Automatic WAIT FOR
----------------------------
Automatic WAIT FOR can be disabled in a script by adding a command like:
"SET CONFIRM_DELAY=off"
or turned on with
"SET CONFIRM_DELAY=on"

(or someone should put forward different syntax)

Open Questions
---------------------
- Is the analysis above correct?
- How do we want to handle TRY blocks. See discussion above
-Are we willing to live with the above limitations?

Comments?