[Slony1-hackers] automatic WAIT FOR proposal

Tue Jan 11 09:52:17 PST 2011

Steve Singer <ssinger at ca.afilias.info> writes:
> This means that if IF  a) we can somehow avoid race conditions at the
> event nodes AND  b) implement the rule "apply all events that were
> visible on the event origin before applying the event" then we can
> eliminate race conditions.
>
> i) I feel (a) needs to be handled by slonik and can be handled by
> slonik waiting for the next event node to be caught up to previous
> event nodes before submitting an event to it (discussed in a previous
> email on this thread, though I have not formally tried to prove this
>
> ii) can be implemented by adding a new column to sl_event that stores
> an array of event tuples - which consist of the highest events from
> each node confirmed on the node at the time sl_event is generated.
> The a remoteWorker won't apply an event to its local node until the
> pre-conditions have been met.
>
> Questions:  What is the performance impact of getting the highest
> confirmed event id values? This is unknown - but will probably involve
> querying sl_event and/or sl_confirm.

That's a very interesting approach; it changes visible behaviour by
evaluating events in a different ordering than might be true today.

But by establishing a consistent ordering of application of events
(e.g. - when there are events coming in from multiple nodes), it should
provide some consistency that doesn't exist today.

I would expect the cost of finding the highest confirmed events to be
fairly low, albeit increasing when there are a lot of nodes, though a
little benchmark isn't confirming that notably well...

I ran the following a bunch of times on a node to inject ~12K sl_confirm
entries similar to those existing:

  insert into sl_confirm (con_origin, con_received, con_seqno) select
  con_origin, con_received, max(con_seqno) + 1 from sl_confirm group by
  con_origin, con_received;

The query that provides the relevant data is the following summarization
of sl_confirm:

slonyregress1 at localhost-> select con_origin, max(con_seqno) from
sl_confirm group by con_origin;

 con_origin |    max
------------+------------
          1 | 5000005860
          2 | 5000005855
(2 rows)

slonyregress1 at localhost->  explain analyze select con_origin,
max(con_seqno) from sl_confirm group by con_origin;
                            QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=250.96..250.99 rows=2 width=12) (actual time=3.588..3.589 rows=2 loops=1)
   ->  Seq Scan on sl_confirm  (cost=0.00..192.31 rows=11731 width=12) (actual time=0.008..1.001 rows=11712 loops=1)
Total runtime: 3.619 ms

While it's not ultra-slow, I observe that this isn't helped in the
slightest by the presence of reasonable indices on sl_confirm.  We could
presumably take advantage of an index if we ran a query for each node
searching for the maximum confirmed sequence; that should apply one or
another of the indices, albeit at the cost that we need to have a stored
function that evaluates a query for each node.

But I'm jumping at internals of implementation here.

> I feel this produces a solution to wait for where slonik only needs to
> wait to make sure the next event node has received all events so far.

The problem with this comes if there is a failure of some "secondary"
node; that could cause downstream nodes to lag on feeding in data from
their providers because there's something broken about the "secondary"
node.

The theory seems sound enough, but if it makes behaviour degrade on a
degraded cluster, that mightn't be a "win." :-(
-- 
select 'cbbrowne' || '@' || 'afilias.info';
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"