Wed Jul 29 14:09:57 PDT 2009
- Previous message: [Slony1-general] Release candidates for 2.0.3, 1.2.17
- Next message: [Slony1-general] Log Switch not happening version 1.2.15
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Quoting Jason Culverhouse <jason at merchantcircle.com>: > > I see these notices on node 20 > > NOTICE: Slony-I: log switch to sl_log_1 still in progress - > > sl_log_2 not truncated > > NOTICE: Slony-I: log switch to sl_log_1 still in progress - > > sl_log_2 not truncated > > NOTICE: Slony-I: log switch to sl_log_1 still in progress - > > sl_log_2 not truncated > > > > I see FETCH statements taking up 100% cpu on node 30 > > > > Any Idea on what to do here? > > This happens when it's continuing to notice tuples lingering in > sl_log_2. > > You should be able to check on node 20 and see that "select * from > _myschema.sl_log_2 limit 1;" returns a tuple. > Returns a tuple on Node 40 and 30 I get this from sl_log_status sequence_name | last_value | increment_by | max_value | min_value | cache_value | log_cnt | is_cycled | is_called ---------------+------------+--------------+-----------+----------- +-------------+---------+-----------+----------- sl_log_status | 3 | 1 | 3 | 0 | 1 | 0 | f | t On node 20 sequence_name | last_value | increment_by | max_value | min_value | cache_value | log_cnt | is_cycled | is_called ---------------+------------+--------------+-----------+----------- +-------------+---------+-----------+----------- sl_log_status | 2 | 1 | 3 | 0 | 1 | 0 | f | t Node 20 only falls behind node 40 and is always a very similar number of log entries behind st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st_lag_time -----------+-------------+---------------+---------------------------- +------------------+---------------------------- +----------------------------+-------------------+----------------- 40 | 20 | 15469518 | 2009-07-29 14:03:51.453982 | 15463732 | 2009-07-29 14:03:09.683189 | 2009-07-29 12:05:51.162247 | 5786 | 01:58:00.426609 40 | 30 | 15469518 | 2009-07-29 14:03:51.453982 | 15469495 | 2009-07-29 14:04:21.481599 | 2009-07-29 14:03:23.781414 | 23 | 00:00:27.807442 This time has been pretty consistent all day > I'd suggest running test_slony_state{-dbi}.pl, to check the general > condition of things. It'll doubtless gripe about there being a lot of > tuples in sl_log_2. Tests for node 40 - DSN = host=node40 dbname=my_database user=postgres port=5432 ======================================== pg_listener info: Pages: 12 Tuples: 2 Size Tests ================================================ sl_log_1 137219 3208410.000000 sl_log_2 570 7030.000000 sl_seqlog 16 258.000000 Listen Path Analysis =================================================== No problems found with sl_listen -------------------------------------------------------------------------------- Summary of event info Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age = = = = = = = = ======================================================================== 30 28037660 28037665 00:00:00 00:10:00 0 40 15462272 15462752 01:56:00 02:05:00 1 20 8725074 8725144 00:00:00 00:11:00 0 --------------------------------------------------------------------------------- Summary of sl_confirm aging Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of eldest SYNC = = = = = = = = = ======================================================================== 20 30 8725074 8725139 00:00:00 00:11:00 0 20 40 8725080 8725139 00:00:00 00:10:00 0 30 20 28037660 28037665 00:00:00 00:10:00 0 30 40 28037660 28037665 00:00:00 00:10:00 0 40 20 15462272 15462752 00:00:00 00:11:00 0 40 30 15467769 15468362 00:00:00 00:10:00 0 ------------------------------------------------------------------------------ Listing of old open connections Database PID User Query Age Query = = = = = = = = ======================================================================== Tests for node 30 - DSN = host=node30 dbname=my_database user=postgres port=5432 ======================================== pg_listener info: Pages: 12 Tuples: 2 Size Tests ================================================ sl_log_1 137219 3208410.000000 sl_log_2 570 7030.000000 sl_seqlog 16 258.000000 Listen Path Analysis =================================================== No problems found with sl_listen -------------------------------------------------------------------------------- Summary of event info Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age = = = = = = = = ======================================================================== 30 28037660 28037665 00:00:00 00:10:00 0 40 15462272 15462752 01:56:00 02:05:00 1 20 8725074 8725144 00:00:00 00:11:00 0 --------------------------------------------------------------------------------- Summary of sl_confirm aging Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of eldest SYNC = = = = = = = = = ======================================================================== 20 30 8725074 8725139 00:00:00 00:11:00 0 20 40 8725080 8725139 00:00:00 00:10:00 0 30 20 28037660 28037665 00:00:00 00:10:00 0 30 40 28037660 28037665 00:00:00 00:10:00 0 40 20 15462272 15462752 00:00:00 00:11:00 0 40 30 15467769 15468362 00:00:00 00:10:00 0 ------------------------------------------------------------------------------ Listing of old open connections Database PID User Query Age Query = = = = = = = = ======================================================================== Tests for node 20 - DSN = host=node20 dbname=my_database user=postgres port=5432 ======================================== pg_listener info: Pages: 12 Tuples: 2 Size Tests ================================================ sl_log_1 137219 3208410.000000 sl_log_2 570 7030.000000 sl_seqlog 16 258.000000 Listen Path Analysis =================================================== No problems found with sl_listen -------------------------------------------------------------------------------- Summary of event info Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age = = = = = = = = ======================================================================== 30 28037660 28037665 00:00:00 00:10:00 0 40 15462272 15462752 01:56:00 02:05:00 1 20 8725074 8725144 00:00:00 00:11:00 0 --------------------------------------------------------------------------------- Summary of sl_confirm aging Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of eldest SYNC = = = = = = = = = ======================================================================== 20 30 8725074 8725139 00:00:00 00:11:00 0 20 40 8725080 8725139 00:00:00 00:10:00 0 30 20 28037660 28037665 00:00:00 00:10:00 0 30 40 28037660 28037665 00:00:00 00:10:00 0 40 20 15462272 15462752 00:00:00 00:11:00 0 40 30 15467769 15468362 00:00:00 00:10:00 0 ------------------------------------------------------------------------------ Listing of old open connections Database PID User Query Age Query = = = = = = = = ======================================================================== Sending message thus - | -s "Slony State Test Warning - Cluster my_cluster" Message: Node: 20 sl_log_1 tuples = 3.20841e+06 > 200000 ================================================ Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which exceeds 200000. You may wish to investigate whether or not a node is down, or perhaps if sl_confirm entries have not been propagating properly. Node: 30 sl_log_1 tuples = 3.20841e+06 > 200000 ================================================ Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which exceeds 200000. You may wish to investigate whether or not a node is down, or perhaps if sl_confirm entries have not been propagating properly. Node: 40 sl_log_1 tuples = 3.20841e+06 > 200000 ================================================ Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which exceeds 200000. You may wish to investigate whether or not a node is down, or perhaps if sl_confirm entries have not been propagating properly. > > I'd *suspect* that a confirmation isn't getting back to node 20, so it > doesn't know that those tuples can be deleted, and hence never gets > around to truncating the log table. > > The "failure of event communications" is the thing I'd expect to be > most likely to be the *real* cause. > > I could be wrong, but it's pretty easy to check (see the other email > thread today where I pointed someone at running the state test tool), > and it is hopefully easy to rectify if that's the case. > > If communications are fine, then the question arises, why aren't those > old tuples in sl_log_2 getting trimmed? That should help direct the > questions, irrespective of cause. > > FYI, if there's some misconfiguration that has caused this, and the > truncate then comes thru (~10-15 minutes later), then you may want to > induce another log switch pretty soon since sl_log_1 probably has a > lot of data crudding it up. Again, someone asked how to induce this > just recently, so take a peek at recent archives for how to do that. > >
- Previous message: [Slony1-general] Release candidates for 2.0.3, 1.2.17
- Next message: [Slony1-general] Log Switch not happening version 1.2.15
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list