Michael Weber dr.michi at gmail.com
Thu Jan 22 15:26:14 PST 2009
Christopher,

thanks a lot for the comments,

test_slony_state did not report anything out of the usual (when
running against the master, against the slave nodes it reported some
missing listen paths from 3 to 1), and there were no ERRORs in the
logfile of node2 & node3. But there were also no INSERTS with a
rowcount higher than zero (the database is not very busy usually,
daytime it is one row per 5 minutes). But sl_log_2 had many rows, and
was growing.

Unfortunately I got annoyed by the problem, dropped the slony schemas,
truncated the replicated tables, recreated & started slony (thanks to
the nice slonik_scripts a piece of cake). All of this took 10 minutes,
even though I would have liked to find the actual problem.

I think I messed up the system last week in trying to get replication
started on node2 without node3 being up-to-date (version wise), and
not knowing that node3 was running out of disk space.

Thanks again for your help, and sorry for not reading enough before
asking stupid questions (I read about the test script only yesterday,
and I had never looked into the sl_* tables before yesterday either.

About the setup:
I have 2 sets: set1 master  is on Tenerife, node2 and node3 are two
machines here in Potsdam (D). set2 master is node2, node3 is the copy
of it (it is data created in Potsdam from the Tenerife data). That is
why I pull the data for node3 from node2 (a tip I had gotten on this
list).

Another question: We have mostly meta-data in our database, the "real"
data is kept in binary files transported via rsync (8 to 32MB
uncompressed, compresses to about half). Would postgresql & slony be
efficient in handling those things also?

Michael
>
> The information in the origin's log tends not to be terribly
> interesting, as the only work it does is to run SYNC events every so
> often.  The slon for that node doesn't do any real replication work.
>
> The question I always ask, at this point, is "what was the output of
> test_slony_state???"
>
> It is a pretty longstanding "best practice" to run that fairly
> frequently (I ask that our DBAs run it against all our clusters on an
> hourly basis), as it represents a very good "early warning" test for a
> number of sorts of misconfiguration that have historically caused
> people problems.
>
> There are a number of ways in which nodes 2 and 3 might be behind, and
> I haven't read anything to distinguish what the cause might be.
>
> - Supposing the disk space outage caused the slons not to run (e.g. -
>  all slons were running on the same host as slave3), then the
>  subscribers could be working their way through one Really Giant SYNC.
>
>  There is a way to avoid this, namely to run generate_syncs.sh
>  reasonably regularly against the origin.
>
> - Perhaps some configuration problem is causing nodes 2/3 to fail to
>  pull data from node 1.
>
> - Supposing the arrangement is 1 --> 2 --> 3, that is,
>    node 2 subscribes to 1, and node 3 subscribes to 2,
>  then there *might* be some benefit to resubscribing node 3 directly
>  to #1.
>
> - It is not evident whether the problem is that:
>
>  a) nodes 2 and 3 are doing work, but just not catching up quickly, or
>
>  b) nodes 2 and 3 are "stuck" somewhere, and aren't progressing.
>
> - I would anticipate the most interesting logs to be those for node
>  #2.
>
>  Particularly interesting would be any error messages.  Grep for
>  "ERROR" :-).
> --
> "cbbrowne","@","linuxfinances.info"
> http://cbbrowne.com/info/slony.html
> "X is like pavement:  once you figure out how to lay  it on the ground
> and  paint yellow  lines on  it, there  isn't much  left to  say about
> it. The exciting new developments  are happening in things that run ON
> TOP of  the pavement, like  cars, bicycles, trucks,  and motorcycles."
> -- Eugene O'Neil <eugene at cs.umb.edu>
>


More information about the Slony1-general mailing list