Sun Mar 27 06:03:41 PDT 2011
- Previous message: [Slony1-general] Slony replication problem - logswitch failure
- Next message: [Slony1-general] How slony can solve a network problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 3/26/2011 8:02 PM, Tim Lloyd wrote: > It wasn't mission critical changes lost. Postgres log was full of messages saying it couldn't switch the log because it was already in progress. Uptime was showing load averages of 60. Checking sl_log_1 it only had 4 entries. nuking it and re-initing the log switch reduced the load average to between 4 and 12. > The slony cleanup thread up to 1.2 does delete the no longer needed sl_log_X entries. So the fact that you found something in there means that you deleted stuff, that probably had not replicated to all nodes yet. Maybe you personally don't care about a few updates lost, but for most of us what you suggested doing is actually a good reason by itself to start rebuilding all replicas. The actual reason why a large backlog in sl_log_X causes problems is that the query plan for selecting that log is scanning the log from the beginning, however far into the log the catch up has progressed already. So the startup cost for the log selection increases more and more until it actually finishes processing that entire sl_log_X. All that time, it cannot and should not finish that log switch. We have a fix for that in the current 2.1 development tree and consider backpatching that logic into 2.0. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
- Previous message: [Slony1-general] Slony replication problem - logswitch failure
- Next message: [Slony1-general] How slony can solve a network problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list