[Slony1-bugs] [Bug 32] Once every two to three months Slony fails to commit confirms

Wed Feb 6 08:50:05 PST 2008

http://www.slony.info/bugzilla/show_bug.cgi?id=32

Christopher Browne <cbbrowne at ca.afilias.info> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #3 from Christopher Browne <cbbrowne at ca.afilias.info>  2008-02-06 08:50:05 ---
Can I get a bit more information as to surrounding activity?  I'll ask about
some specifics, presently, though I'll start by elaborating on what's happening
(which may actually answer some of the questions I'm anticipating ;-))...

This seems a bit like a scenario that I have seen, where a node has been
offline for a period of time, with the result that an enormous number of SYNC
events have been outstanding, and it takes >5 minutes (300s, corresponding to
the value of remote_listen_timeout) to run the query trying to pull O/S events. 

It's definitely *not* the same scenario:
 - You only list 126 in the sample, which isn't, in this context, "enormous." 
"Enormous" would be thousands to tens of thousands of events...
 - The above scenario takes place in remote_listen.c; the problem you are
observing takes place in remote_worker.c, and, based on the inputs, I can trace
it exactly to the first usage of query_append_event() in the main remote worker
thread.

Looking at the code, what is expected, after processing a SYNC, is that we add
in, for each processed SYNC (if we've been grouping, there might be a whole
bunch processed together), query_append_event() does two things:

 - Copies the event information from the remote node into the local node's
sl_event table

 - Creates sl_confirm entries, on the local node, to indicate that the event
was processed locally.

 - Actually, 3 things: It also generates NOTIFY statements to cry out to other
nodes: "Extra, extra, read all about it!  Events processed on this node!"  I
see something excessive, there; we generate a boatload of NOTIFY requests even
though there will never be need for more than 2.  That cries out for
optimization, to suppress excess NOTIFY requests.

I thought I'd have to ask more about the SYNCs being processed; closer
examination shows that you have listed everything.

This example was processing SYNCs 2214870 thru 2214932, a total of 63 SYNCs
grouped together.  That could indicate quite a lot of data, perhaps, but there
is no locally relevant reason for this to timeout.  In terms of "local stuff
going on at the time," it's doing 126 NOTIFY requests and 126 INSERTs, and
while 124 of the NOTIFYs are redundant, there's nothing overly heinous about
this.

I don't see a Slony-I reason for the timeout:

"PGRES_FATAL_ERROR could not receive data from server: Connection timed out"

My next best guess here is that these large sets of SYNCs are processing a lot
of data, and that this is "tickling" some problem with the network
infrastructure.

Now, after this fails, the slon should drop down to doing just 1 SYNC, and then
start doubling numbers of SYNCs again, presumably until it hits the network
issue again.

You might want to decrease the config parameter "sync_group_maxsize" (either on
slon command line or in slon.conf file) so that you don't tempt the network
problem.

As a CVS HEAD item, I'm going to see about cutting down on the number of NOTIFY
requests, but I don't think that's relevant to your problem.

-- 
Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are the assignee for the bug.