[Slony1-general] server crash on slave: slon-start to do catch-up VS re-subscribe

Wed Dec 1 10:42:02 PST 2010

Vick Khera <vivek at khera.org> writes:
> On Wed, Dec 1, 2010 at 9:33 AM, Mark Steben <msteben at autorevenue.com> wrote:
>>  Should I:
>>  1. issue a ./slon_start and allow slony to catch-up two weeks worth of
>>     updates
>>        Or
>>  2. Punt, start from scratch, redefine everything and rerun the 16 hour
>>     subscription process?
>
> It all depends on the rate of change in your database. Do you do a lot
> of insert/update/deletes to the DB?  Take a look at the number of rows
> in the sl_log_1 and sl_log_2 tables in your replication schema on the
> master.  If the number is very very high, say in the tens or hundreds
> of milliions, then perhaps restarting from scratch may be helpful.  If
> it is in the low tens of millions, I'd venture to day you could
> recover by just restarting slony.  One optimization may be to drop all
> indexes (except the PK index) on the replica until it is caught up.
> This will reduce the I/O it needs to apply the changes.

It's uncertain what the bottleneck will be; that may well depend on
local characteristics, such that it may be a mistaken assumption to
assume that saving on index writes is material.

I think something would be learned by simply letting Slony catch up.
There are some interesting open questions as to the pathologies when
there's truly a lot of data in sl_log_1/2.

I don't imagine it would take too terribly long to figure out if things
are catching up, between:

  a) Watching the sl_status view to verify that the lag times are
     falling, and

  b) Grepping the subscriber's logs for the following log lines:

  slon_log(SLON_DEBUG1, "remoteHelperThread_%d_%d: inserts=%d updates=%d deletes=%d truncates=%d\n",
				 node->no_id, provider->no_id,
				 pm.num_inserts, pm.num_updates,
				 pm.num_deletes, pm.num_truncates);

  slon_log(SLON_INFO, "remoteWorkerThread_%d: SYNC "
			 INT64_FORMAT " done in %.3f seconds\n",
			 node->no_id, event->ev_seqno,
			 TIMEVAL_DIFF(&tv_start, &tv_now));

If, after a few hours, things aren't catching up, it should be easy
enough to drop and resubscribe.

Something we want to do (and Jan has on his todo list) is to see what
pathologies fall out for the query that pulls sl_log_* data.  That'll
involve adding some extra logging, such as submitting an "EXPLAIN"
against the relevant query to see how pricey it appears to be.
-- 
"cbbrowne","@","afilias.info"
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"