Steve Singer ssinger_pg at sympatico.ca
Tue Dec 27 09:54:32 PST 2011
On Tue, 27 Dec 2011, Mike Wilson wrote:

1. The good news is restarting postgresql is unlikely to fix your problem.

2. Now that you have killed the old vacuums you need to determine what if 
anything slony is doing.

a) Are new SYNC events being generated on the master?

b) According to the master is the master confirming SYNC events.  Even if 
the slave  is two weeks behind is the highest confirmed event # increasing 
over time (ie over an hour).

c) Is the slave receiving new SYNC events FROM THE MASTER?

d) What does your slon log for the slave slon say it is doing?  Is it 
processing SYNC events that take a very long time? Is it stuck? How long 
does it take to process a SYNC event? How many inserts,updates,deletes are 
in this?

e) Are there any other ERROR messages in your slon log (use search, don't 
eyeball it).  ERROR messages contain the word 'ERROR'

If replication  is working just very far behind it might be faster for you 
to create a new slave instead of waiting for the existing slave to catch up. 
This depends on how large your database is, how far behind it is and how 
fast your network is.

Steve

> Under incredible load last week during the Christmas season our primary PG 
> (8.4.7: Slony 2.0.7) stopped replicating.  Now that we are past the 
> Christmas season I need to figure out how to clear the back log of 
> replication rows in sl_log_1/2.  This is all running on our commercial 
> website and if possible I would prefer not to have to restart the PG 
> instance as it would require a scheduled maintenance window on a week 
> where everyone is out of the office.  In an attempt to fix the issue 
> without rebooting the PG instance and I've already restarted the Slony 
> services on the primary PG node as a first attempt at a fix.  This did not 
> get replication working again and I'm still getting the same error from 
> Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not 
> truncated
>
>> From my research I can see that this error message is called when the 
>> function logswitch_finish() is called.  I did have some hung vacuums 
>> during this period of high load on the server but I have killed them with 
>> pg_cancel_backend.  From other lock analysis I can see that nothing is 
>> currently running or locked in the db (nothing more than a few 
>> milliseconds old at least).  I'm certain whatever transaction was in 
>> progress that prevented the switch from occurring is long since past.
>
> Any ideas on the best way to get replication working again?  I'm adverse to rebooting the PG instance but I am willing to do it if someone more knowlegable out there thinks it would fix this issue.  We currently are operating without a backup of all of our XMas sales data and I *really* want to get this data replicated.  Any help would be appreciated.
>
> Mike Wilson
> Predicate Logic
> Cell: (310) 600-8777
> SkypeID: lycovian
>
>
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general
>



More information about the Slony1-general mailing list