[Slony1-general] sl_log_1 filling

Tue Jan 24 12:23:11 PST 2006

Great, thanks for the info. I've been meaning to get back to this but I've
been out of the office a bit.

I had already been through the faq (that's why I did the vacuums) but so far
I still can't find anything.  The query you posted returned no results.  I'm
looking at the test_slony_state-dbi.pl now but so far it only tells me
sl_seqlog, sl_log_1, sl_seqlog all exceed 200000.  Some of the tests fail
with perl errors so I'll try to get those test run.

I've also looked quite a bit at the logs and have found nothing.  The
cleanupThread is starting every 5 - 15 minutes and reports things like:
2006-01-06 12:25:47 AST DEBUG1 cleanupThread:    5.849 seconds for
cleanupEvent()
2006-01-06 12:26:20 AST DEBUG3 cleanupThread: minxid: 199383042
2006-01-06 12:26:20 AST DEBUG4 cleanupThread: xid 199383042 still active -
analyze instead
2006-01-06 12:39:49 AST DEBUG1 cleanupThread:    3.261 seconds for
cleanupEvent()
2006-01-06 12:40:08 AST DEBUG1 cleanupThread:   18.638 seconds for delete
logs

except for the xid 199383042 still active - analyze instead nothing really
jumps out at me.

I'll keep looking as time permits.

Thanks again,
Robert

-----Original Message-----
From: Christopher Browne [mailto:cbbrowne at ca.afilias.info]
Sent: Thursday, January 19, 2006 5:31 PM
To: Robert Littlejohn
Cc: 'slony1-general at gborg.postgresql.org'
Subject: Re: [Slony1-general] sl_log_1 filling

Robert Littlejohn wrote:

>Hello,
>I have a simple master to one slave cluster setup.  My setup is
slony1-1.1.0
>with postgresql-7.3.9-2 on a Red Hat ES v3 server (both master and slave
and
>the same).  
>
>This worked great for several months even with frequent network outages.
>Then in Nov the router at the slave site was changed and the replication
>stopped but nobody noticed.  Now several months later I have over 5 million
>entries in sl_log_1 and over 400, 000 in sl_event.  I see no errors in the
>logs, I've vacuumed all tables including the sl_* and pg_* tables - no
>change.  I've restarted both servers - I've viewed the network connections
>and all are good.  The slon processes run fine and the postres's are
talking
>to each other  but still no replication.
>
>small example from slave log:
>2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC
953791
>2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC
953792
>2006-01-19 16:50:12 AST DEBUG2 localListenThread: Received event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC
953793
>
>and from the master:
>2006-01-19 16:50:02 AST DEBUG2 remoteListenThread_2: queue event 2,953790
>SYNC
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2: Received event
2,953790
>SYNC
>2006-01-19 16:50:02 AST DEBUG3 calc sync size - last time: 1 last length:
>9881 ideal: 6 proposed size: 2
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2: SYNC 953790 processing
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2: no sets need syncing
>for this event
>2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 7342475 - SYNC
>1037985
>2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 1,1037985
>SYNC
>2006-01-19 16:50:06 AST DEBUG2 syncThread: new sl_action_seq 7342477 - SYNC
>1037986
>2006-01-19 16:50:07 AST DEBUG2 localListenThread: Received event 1,1037986
>SYNC
>2006-01-19 16:50:10 AST DEBUG2 syncThread: new sl_action_seq 7342479 - SYNC
>1037987
>2006-01-19 16:50:11 AST DEBUG2 localListenThread: Received event 1,1037987
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 remoteListenThread_2: queue event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2: Received event
2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG3 calc sync size - last time: 1 last length:
>10001 ideal: 5 proposed size: 2
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2: SYNC 953791 processing
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2: no sets need syncing
>for this event
>2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 7342482 - SYNC
>1037988
>2006-01-19 16:50:13 AST DEBUG2 localListenThread: Received event 1,1037988
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 remoteListenThread_2: queue event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2: Received event
2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG3 calc sync size - last time: 1 last length:
>10071 ideal: 5 proposed size: 2
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2: SYNC 953792 processing
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2: no sets need syncing
>for this event
>
>Any help anybody can give me will be appreciated.  BTW if this is the wrong
>mailling list could someone point me at the correct location for slony1
>problems.
>  
>
This is by all means the right list.

Take a look at the FAQ...  http://linuxfinances.info/info/faq.html

This is consistent with the cleanup thread not cleaning things out.

It's evident that you're running 1.1 (from the ideal/proposed SYNC group
sizes in the logs)...

There used to be common problems where dropping nodes would lead to
confirmations not getting, well, confirmed, and that would prevent
things from clearing out.  That has been fixed for a long time;
nonetheless, problems with event propagation can lead to the cleanup
thread deciding that it can't yet clean out data.

Try the following query (change the namespace appropriately!):
select * from _oxrsbar.sl_confirm where con_origin not in (select no_id
from _oxrsbar.sl_node) or con_received not in (select no_id from
_oxrsbar.sl_node);

You might also try running (find this in the "tools" directory) either
test_slony_state-dbi.pl or test_slony_state.pl. These point to one of
the nodes, and then rummage through the configuration looking for a
number of possible problem conditions.

If you're not keen on Perl, you can look in test_slony_state*.pl to see
what queries they are issuing to look at the state of things.  That may
help you to tell where the problem lies.

You might also take a peek at those log files and look for the cleanup
thread to see what kind of work it's doing.  Chances are that there's
some long outstanding missing event confirmation or such; as soon as it
gets dealt with, sl_log_1 will get cleaned out (over the next 10
minutes; that thread wakes up only periodically...)