[Slony1-general] Flakey network links

Mon Sep 10 18:57:08 PDT 2007

On Mon, 2007-09-10 at 10:34 -0700, Andrew Hammond wrote:
> On 9/9/07, Tim Bowden <tim.bowden at westnet.com.au> wrote:
>         On Mon, 2007-09-10 at 10:54 +0800, Tim Bowden wrote:
>         > >From the docs:
>         > cases where Slony-I probably won't work out well would
>         include:
>         >
>         >       * Sites where connectivity is really "flakey" 
>         >
>         >       * Replication to nodes that are unpredictably
>         connected.
>         >
>         > How flakey/unpredictably connected can nodes be before it
>         all goes
>         > haywire?  Is it time critical, or load critical?
> 
> Yes, and yes.
> 
> 
>         > If an origin node goes 
>         > offline for a day, but there are only a couple of
>         transactions, will
>         > that be a problem?  If an origin node goes offline for a few
>         minutes but
>         > there are hundreds of transactions, what's the recovery
>         scenario look 
>         > like?
> 
> Assuming your run each slon either on the same box as the database
> it's supporting or in the same LAN, this is probably survivable. You
> will need to restart your slons every time the network status for
> _any_ of your databases changes. And yes, detecting and handling this
> correctly is likely to get complicated. 

Way too complicated for the environment I'm looking at.

> 
> 
> Your slons will generate transactions regularly on all nodes in the
> form of SYNCs. These need to be propagated between all nodes and then
> applied. Once SYNCs (and other events) have been applied, confirmation
> messages are propagated between all nodes. Once all nodes have applied
> events, then the cleanup thread on each node can remove the
> information necessary for confirmed events. 
> 
> For the cases you mention above, the obvious failure scenarios are as
> follows.
> 1) Network failures at a rate where a slon can not process all the
> items in some event before getting reset. 
> 2) Any one node being down long enough to grow sl_log_n to the point
> that it enters the "death spiral" (becomes so large that maintenance
> costs cause it to grow faster than it can be consumed). 
> 

Given 100+ nodes each on a different LAN, I think it's safe to assume we
will have significant network issues.  In this case we need to do log
shipping by default and std replication only within a well connected
core group of nodes.

Given the need for each node to do std replication to at least one other
node, I'll set up each remote node to replicate to itself (I believe
this is possible) so it can then log ship to central slave.

> To quote Jan's concept paper (which you really ought to read before
> going further in this discussion:
> http://developer.postgresql.org/~wieck/slony1/Slony-I-concept.pdf),
> "Neither offline nodes that only become available for sporadic
> synchronization (the salesman on the road) nor ... will be
> supported..."
> 

Thanks, the more reading I do the better at this stage.

> 
>         As a follow up, I noticed a post a week ago or thereabouts I
>         think it
>         was that mentioned bouncing nodes between standard replication
>         and
>         updating by log shipping, but it wasn't currently a viable
>         solution.  Is
>         this likely to ever become a viable solution, as it would
>         solve the
>         problem of unpredictable network links (at least for some use
>         cases)?
> 
> That sounds kinda complicated. Has anyone written a proposal for how
> to do it yet? It's taken us almost 2 years to get log-shipping to the
> point where it seems seriously viable. The project has existed for
> something approaching 4 years... 

The more I learn the more I see why this is so complicated.  I won't
count on this feature.

> 
> Andrew
> 
Thanks,
Tim Bowden