[Slony1-general] Catching up a large backlog: a few observations

Thu Apr 24 09:43:55 PDT 2008

Hi,

At 17:54 24/04/2008, Christopher Browne wrote:
>I'm not sure we want to stop queueing events altogether for any
>extended period of time.  *That* seems like a risky thing to do.

I'm not quite sure why? I'd rather have them stay untouched in the DB 
rather than have slon grow (potentially a lot) for no good reason.

>- If the problem is that there is a backlog of SYNCs that *will* need
>   to be processed, then metering them in, via "LIMIT N" + some delays,
>   should prevent the slon from blowing up.  If it *does* blow up, it'll
>   restart, *hopefully* after getting some work done.

It does indeed, but in the meantime you have requested lots of events 
that ended up not being used (and which you will fetch again on the 
next run), you have used memory that could be more useful as OS 
cache, and in many cases you actually end up stopping slon quite 
abruptly while it's fetching data, with postgres continuing to work 
on that fetch while you start a new one. And really the "oh anyway it 
will crash and restart" approach gives be goose bumps for something 
related to DB replication!

>The alternative solution is to do "strict metering" where we don't
>allow the queue to grow past some pre-defined size.  But I'm not sure
>what that size should be.

n x sync_group_maxsize? With n somewhere between 2 and 10, I'd say.

>Ah, you could be right there.  Yes, it may be that the "time for first
>fetch" is nearly constant, and so should be taken out of the estimate.

It's at least somewhat constant for periods of time, when the index 
isn't selective enough and the initial fetch needs a lot of work. 
Over longer periods it does vary quite significantly.

>Mind you, we may be "gilding buggy whips" here; trying to improve an
>estimate that is fundamentally flawed.  There is the fundamental flaw
>that there is no real reason to expect two SYNCs to be equally
>expensive, if there is variation in system load.

Over consecutive runs I would expect them to be quite consistent, 
there would just be an issue at the point where the load changes 
(start or end of a batch job, etc.). Obviously for a system with lots 
of short spikes and low values of desired_sync_time it would not make 
much sense, but then I'm not sure the desired_sync_time would make 
much sense either?

Jacques.