Thu May 29 12:26:31 PDT 2008
- Previous message: [Slony1-general] Slow replication issue
- Next message: [Slony1-general] auto vac output: Page Slots
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 5/27/2008 10:26 AM, Vivek Khera wrote: > On May 23, 2008, at 2:05 PM, Jan Wieck wrote: > >> The slony connections are all regular libpq database connections. So >> you might test this by using psql or pg_dump running on one of those >> subscribers, connecting to the appropriate data provider. If that >> can utilize more bandwidth, then the problem lies within the replica >> itself and something else must be limiting it from reading from the >> network faster. > > Every time I dig into why our replication is lagging severely (more > than 10-15 mintues) I find that I'm spending a lot of time inside > "FETCH 100" queries, and then a lot of time between them, as well (ie, > applying the updates to the replica). It feels as of pg isn't running > at full speed, but I don't have the numbers to prove it. Which is surprising, because the entire (complex I might add) architecture of helper threads doing the fetch, placing result rows into buffers and shoveling them towards the remote_worker thread was supposed to reduce those times, where the remote_worker is "waiting" for rows instead of full-bore applying changes. One thing that comes to mind would be if the slon is actually running on the same box as any of the involved databases. In that case I can think of a scenario where the remote helper is buffering quite well ahead ... and since the local DB is heavily under fire the OS thinks it's a good idea to page those buffers out. To make better educated guesswork we probably need a few more DEBUG points where the remote worker and helpers are issuing messages when the internal buffer exceeds some high water mark or when (after exceeding high) falls back below some low water mark. That would help us very much to fine tune the actual amount of buffers and the fetch size (both of which should be config options). Another thing to look for is the dreaded "delay for first row". That is the time it takes the data provider from when the subscriber asks for the current SYNC's log rows until the data provider actually returns the first FETCH chunk. That is definitely time that the subscriber is doing nothing but twiddling thumbs. Unfortunately, it is going to add a lot more complexity in the worker/helper architecture to cure that problem. But I would like to hear what are typical ratios of "delay for first row" / "time for entire sync" out in the field. Because if that delay accounts for a large portion of the entire sync processing we better look into improving that part, however complex the solution to it might be. Again, it would help to have some better logging here that simply states the percentage of time spent waiting during the sync processing. That would be the sum of all delays caused when the worker is waiting for log rows. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
- Previous message: [Slony1-general] Slow replication issue
- Next message: [Slony1-general] auto vac output: Page Slots
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list