Tue Dec 27 14:18:03 PST 2011
- Previous message: [Slony1-general] log switch to sl_log_1 still in progress - sl_log_2 not truncated
- Next message: [Slony1-general] log switch to sl_log_1 still in progress - sl_log_2 not truncated
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks for the reply. Per your questions here is additional information. For the purposes of this discussion db1 (node: 101) is our Slave and db2 (node:102) is our Slony Master. > On Tue, 27 Dec 2011, Mike Wilson wrote: > > 1. The good news is restarting postgresql is unlikely to fix your problem. I figured as much but I would still like to try it. I've hunted around to try to find out how to track down the PG session that is associated with an xid but thus far no such luck. > > 2. Now that you have killed the old vacuums you need to determine what if anything slony is doing. During the period where the replication stopped I found several vacuums that had been running for upwards of 11 days. I killed the processes (using pg_cancel_backup) and shortly after that started to get pings from Nagios which monitors the sl_status table that replication was falling behind. > > a) Are new SYNC events being generated on the master? I'm not sure how I determine this. > > b) According to the master is the master confirming SYNC events. Even if the slave is two weeks behind is the highest confirmed event # increasing over time (ie over an hour). See next question. The slave's st_last_event value is increasing, whereas the master's is static. > > c) Is the slave receiving new SYNC events FROM THE MASTER? I'm not sure. How would I tell? See below for snippets of our current logs. It does appear that the two nodes are doing "something". Here is the query output of the slave over a period of a few minutes: =# select * from sl_status; st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st_lag_time -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+----------------- 101 | 102 | 5000826162 | 2011-12-27 13:55:56.563939 | 5000824786 | 2011-12-27 10:06:32.519807 | 2011-12-27 10:06:31.719339 | 1376 | 03:49:28.222625 =# select * from sl_status; st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st_lag_time -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+---------------- 101 | 102 | 5000826238 | 2011-12-27 14:08:36.821359 | 5000824786 | 2011-12-27 10:06:32.519807 | 2011-12-27 10:06:31.719339 | 1452 | 04:02:05.30789 Here is the sl_status table from the master over the same few minutes: =# select * from sl_status; st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st_lag_time -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------ 102 | 101 | 5004119000 | 2011-12-27 10:06:44.327499 | 5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 | 133831 | 3 days 06:40:41.911154 =# select * from sl_status; st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st_lag_time -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------ 102 | 101 | 5004119000 | 2011-12-27 10:06:44.327499 | 5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 | 133831 | 3 days 06:42:49.222796 As you can see the slaves st_last_event value appears to be increasing, whereas the master's appears to be static. > > d) What does your slon log for the slave slon say it is doing? Is it processing SYNC events that take a very long time? Is it stuck? How long does it take to process a SYNC event? How many inserts,updates,deletes are in this? Please see below for snippets of the master/slave slony.log > > e) Are there any other ERROR messages in your slon log (use search, don't eyeball it). ERROR messages contain the word 'ERROR' There are no recent error messages in either the master or slaves Slony logs. There was an error in the master's slony.log regarding an xid. When I restarted the Slony processes though on our master it apparently overwrote the log and I don't have a copy of the log from before the restart. Here is a small bit of the current slony.log from the master (node: 102). Other then the NOTICE regarding the log switch the log is clean (I grep'd it for errors also): 2011-12-27 09:23:47 PSTINFO cleanupThread: 4.996 seconds for cleanupEvent() NOTICE: Slony-I: log switch to sl_log_1 still in progress - sl_log_2 not truncated CONTEXT: PL/pgSQL function "cleanupevent" line 99 at assignment 2011-12-27 09:34:37 PSTINFO cleanupThread: 4.903 seconds for cleanupEvent() NOTICE: Slony-I: log switch to sl_log_1 still in progress - sl_log_2 not truncated CONTEXT: PL/pgSQL function "cleanupevent" line 99 at assignment 2011-12-27 09:45:13 PSTINFO cleanupThread: 2.795 seconds for cleanupEvent() 2011-12-27 09:45:13 PSTINFO cleanupThread: 0.005 seconds for vacuuming NOTICE: Slony-I: log switch to sl_log_1 still in progress - sl_log_2 not truncated CONTEXT: PL/pgSQL function "cleanupevent" line 99 at assignment Here is what I'm seeing on the slaves (node: 101) slony.log, it also looks clean with no recent errors (I grep'd). 2011-12-27 14:05:11 PSTINFO remoteWorkerThread_102: syncing set 3 with 24 table(s) from provider 102 2011-12-27 14:05:11 PSTINFO remoteWorkerThread_102: syncing set 1 with 185 table(s) from provider 102 2011-12-27 14:05:11 PSTINFO remoteWorkerThread_102: syncing set 2 with 13 table(s) from provider 102 NOTICE: Slony-I: log switch to sl_log_1 complete - truncate sl_log_2 CONTEXT: PL/pgSQL function "cleanupevent" line 99 at assignment 2011-12-27 14:05:45 PSTINFO cleanupThread: 0.037 seconds for cleanupEvent() 2011-12-27 14:05:45 PSTINFO cleanupThread: 0.008 seconds for vacuuming 2011-12-27 14:06:32 PSTINFO remoteWorkerThread_102: SYNC 5003985339 done in 80.973 seconds 2011-12-27 14:06:32 PSTINFO remoteWorkerThread_102: syncing set 3 with 24 table(s) from provider 102 2011-12-27 14:06:32 PSTINFO remoteWorkerThread_102: syncing set 1 with 185 table(s) from provider 102 2011-12-27 14:06:32 PSTINFO remoteWorkerThread_102: syncing set 2 with 13 table(s) from provider 102 Note that I have three replication sets defined. > > If replication is working just very far behind it might be faster for you to create a new slave instead of waiting for the existing slave to catch up. This depends on how large your database is, how far behind it is and how fast your network is. If at all possible I need to heal the current cluster rather than producing a new one. The slave has numerous other schemas, tables, indexes, etc so it isn't a duplicate of the master. Because of this, if at all possible, I would like to try to get replication working again, even if it took weeks for the two systems to come back in sync. Thanks so much for your help thus far. Please advise with any attempted fixes you can think of or if you can possibly describe what is currently going on. Cheers. > > Steve > >> Under incredible load last week during the Christmas season our primary PG (8.4.7: Slony 2.0.7) stopped replicating. Now that we are past the Christmas season I need to figure out how to clear the back log of replication rows in sl_log_1/2. This is all running on our commercial website and if possible I would prefer not to have to restart the PG instance as it would require a scheduled maintenance window on a week where everyone is out of the office. In an attempt to fix the issue without rebooting the PG instance and I've already restarted the Slony services on the primary PG node as a first attempt at a fix. This did not get replication working again and I'm still getting the same error from Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not truncated >> >>> From my research I can see that this error message is called when the function logswitch_finish() is called. I did have some hung vacuums during this period of high load on the server but I have killed them with pg_cancel_backend. From other lock analysis I can see that nothing is currently running or locked in the db (nothing more than a few milliseconds old at least). I'm certain whatever transaction was in progress that prevented the switch from occurring is long since past. >> >> Any ideas on the best way to get replication working again? I'm adverse to rebooting the PG instance but I am willing to do it if someone more knowlegable out there thinks it would fix this issue. We currently are operating without a backup of all of our XMas sales data and I *really* want to get this data replicated. Any help would be appreciated. >> >> Mike Wilson >> Predicate Logic >> Cell: (310) 600-8777 >> SkypeID: lycovian >> >> >> >> >> _______________________________________________ >> Slony1-general mailing list >> Slony1-general at lists.slony.info >> http://lists.slony.info/mailman/listinfo/slony1-general >> >
- Previous message: [Slony1-general] log switch to sl_log_1 still in progress - sl_log_2 not truncated
- Next message: [Slony1-general] log switch to sl_log_1 still in progress - sl_log_2 not truncated
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list