Wed Feb 6 08:50:05 PST 2008
- Previous message: [Slony1-bugs] [Bug 9] Revise documentation to indicate new trigger handling
- Next message: [Slony1-bugs] [Slony] vacuum locks all tables?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
http://www.slony.info/bugzilla/show_bug.cgi?id=32 Christopher Browne <cbbrowne at ca.afilias.info> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #3 from Christopher Browne <cbbrowne at ca.afilias.info> 2008-02-06 08:50:05 --- Can I get a bit more information as to surrounding activity? I'll ask about some specifics, presently, though I'll start by elaborating on what's happening (which may actually answer some of the questions I'm anticipating ;-))... This seems a bit like a scenario that I have seen, where a node has been offline for a period of time, with the result that an enormous number of SYNC events have been outstanding, and it takes >5 minutes (300s, corresponding to the value of remote_listen_timeout) to run the query trying to pull O/S events. It's definitely *not* the same scenario: - You only list 126 in the sample, which isn't, in this context, "enormous." "Enormous" would be thousands to tens of thousands of events... - The above scenario takes place in remote_listen.c; the problem you are observing takes place in remote_worker.c, and, based on the inputs, I can trace it exactly to the first usage of query_append_event() in the main remote worker thread. Looking at the code, what is expected, after processing a SYNC, is that we add in, for each processed SYNC (if we've been grouping, there might be a whole bunch processed together), query_append_event() does two things: - Copies the event information from the remote node into the local node's sl_event table - Creates sl_confirm entries, on the local node, to indicate that the event was processed locally. - Actually, 3 things: It also generates NOTIFY statements to cry out to other nodes: "Extra, extra, read all about it! Events processed on this node!" I see something excessive, there; we generate a boatload of NOTIFY requests even though there will never be need for more than 2. That cries out for optimization, to suppress excess NOTIFY requests. I thought I'd have to ask more about the SYNCs being processed; closer examination shows that you have listed everything. This example was processing SYNCs 2214870 thru 2214932, a total of 63 SYNCs grouped together. That could indicate quite a lot of data, perhaps, but there is no locally relevant reason for this to timeout. In terms of "local stuff going on at the time," it's doing 126 NOTIFY requests and 126 INSERTs, and while 124 of the NOTIFYs are redundant, there's nothing overly heinous about this. I don't see a Slony-I reason for the timeout: "PGRES_FATAL_ERROR could not receive data from server: Connection timed out" My next best guess here is that these large sets of SYNCs are processing a lot of data, and that this is "tickling" some problem with the network infrastructure. Now, after this fails, the slon should drop down to doing just 1 SYNC, and then start doubling numbers of SYNCs again, presumably until it hits the network issue again. You might want to decrease the config parameter "sync_group_maxsize" (either on slon command line or in slon.conf file) so that you don't tempt the network problem. As a CVS HEAD item, I'm going to see about cutting down on the number of NOTIFY requests, but I don't think that's relevant to your problem. -- Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are the assignee for the bug.
- Previous message: [Slony1-bugs] [Bug 9] Revise documentation to indicate new trigger handling
- Next message: [Slony1-bugs] [Slony] vacuum locks all tables?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-bugs mailing list