[Slony1-general] Log Switch not happening version 1.2.15

Wed Jul 29 14:09:57 PDT 2009

> Quoting Jason Culverhouse <jason at merchantcircle.com>:
> > I see these notices on node 20
> > NOTICE:  Slony-I: log switch to sl_log_1 still in progress -
> > sl_log_2 not truncated
> > NOTICE:  Slony-I: log switch to sl_log_1 still in progress -
> > sl_log_2 not truncated
> > NOTICE:  Slony-I: log switch to sl_log_1 still in progress -
> > sl_log_2 not truncated
> >
> > I see FETCH statements taking up 100% cpu on node 30
> >
> > Any Idea on what to do here?
>
> This happens when it's continuing to notice tuples lingering in  
> sl_log_2.
>
> You should be able to check on node 20 and see that "select * from
> _myschema.sl_log_2 limit 1;" returns a tuple.
>

Returns a tuple

on Node 40 and  30 I get this from sl_log_status
  sequence_name | last_value | increment_by | max_value | min_value |  
cache_value | log_cnt | is_cycled | is_called
---------------+------------+--------------+-----------+----------- 
+-------------+---------+-----------+-----------
  sl_log_status |          3 |            1 |         3 |         0  
|           1 |       0 | f         | t

On node 20
sequence_name | last_value | increment_by | max_value | min_value |  
cache_value | log_cnt | is_cycled | is_called
---------------+------------+--------------+-----------+----------- 
+-------------+---------+-----------+-----------
  sl_log_status |          2 |            1 |         3 |         0  
|           1 |       0 | f         | t

Node 20 only falls behind node 40 and is always a very similar number  
of log entries behind
  st_origin | st_received | st_last_event |      st_last_event_ts       
| st_last_received |    st_last_received_ts     |  
st_last_received_event_ts  | st_lag_num_events |   st_lag_time
-----------+-------------+---------------+---------------------------- 
+------------------+---------------------------- 
+----------------------------+-------------------+-----------------
         40 |          20 |      15469518 | 2009-07-29 14:03:51.453982  
|         15463732 | 2009-07-29 14:03:09.683189 | 2009-07-29  
12:05:51.162247 |              5786 | 01:58:00.426609
         40 |          30 |      15469518 | 2009-07-29 14:03:51.453982  
|         15469495 | 2009-07-29 14:04:21.481599 | 2009-07-29  
14:03:23.781414 |                23 | 00:00:27.807442

This time has been pretty consistent all day

> I'd suggest running test_slony_state{-dbi}.pl, to check the general
> condition of things.  It'll doubtless gripe about there being a lot of
> tuples in sl_log_2.

Tests for node 40 - DSN = host=node40 dbname=my_database user=postgres  
port=5432
========================================
pg_listener info:
Pages: 12
Tuples: 2

Size Tests
================================================
        sl_log_1    137219 3208410.000000
        sl_log_2       570 7030.000000
       sl_seqlog        16 258.000000

Listen Path Analysis
===================================================
No problems found with sl_listen

--------------------------------------------------------------------------------
Summary of event info
  Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
      30  28037660  28037665     00:00:00     00:10:00    0
      40  15462272  15462752     01:56:00     02:05:00    1
      20   8725074   8725144     00:00:00     00:11:00    0

---------------------------------------------------------------------------------
Summary of sl_confirm aging
    Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age  
of eldest SYNC
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
        20         30    8725074    8725139      00:00:00       
00:11:00    0
        20         40    8725080    8725139      00:00:00       
00:10:00    0
        30         20   28037660   28037665      00:00:00       
00:10:00    0
        30         40   28037660   28037665      00:00:00       
00:10:00    0
        40         20   15462272   15462752      00:00:00       
00:11:00    0
        40         30   15467769   15468362      00:00:00       
00:10:00    0

------------------------------------------------------------------------------

Listing of old open connections
        Database             PID            User    Query  
Age                Query
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================

Tests for node 30 - DSN = host=node30 dbname=my_database user=postgres  
port=5432
========================================
pg_listener info:
Pages: 12
Tuples: 2

Size Tests
================================================
        sl_log_1    137219 3208410.000000
        sl_log_2       570 7030.000000
       sl_seqlog        16 258.000000

Listen Path Analysis
===================================================
No problems found with sl_listen

--------------------------------------------------------------------------------
Summary of event info
  Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
      30  28037660  28037665     00:00:00     00:10:00    0
      40  15462272  15462752     01:56:00     02:05:00    1
      20   8725074   8725144     00:00:00     00:11:00    0

---------------------------------------------------------------------------------
Summary of sl_confirm aging
    Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age  
of eldest SYNC
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
        20         30    8725074    8725139      00:00:00       
00:11:00    0
        20         40    8725080    8725139      00:00:00       
00:10:00    0
        30         20   28037660   28037665      00:00:00       
00:10:00    0
        30         40   28037660   28037665      00:00:00       
00:10:00    0
        40         20   15462272   15462752      00:00:00       
00:11:00    0
        40         30   15467769   15468362      00:00:00       
00:10:00    0

------------------------------------------------------------------------------

Listing of old open connections
        Database             PID            User    Query  
Age                Query
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================

Tests for node 20 - DSN = host=node20 dbname=my_database user=postgres  
port=5432
========================================
pg_listener info:
Pages: 12
Tuples: 2

Size Tests
================================================
        sl_log_1    137219 3208410.000000
        sl_log_2       570 7030.000000
       sl_seqlog        16 258.000000

Listen Path Analysis
===================================================
No problems found with sl_listen

--------------------------------------------------------------------------------
Summary of event info
  Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
      30  28037660  28037665     00:00:00     00:10:00    0
      40  15462272  15462752     01:56:00     02:05:00    1
      20   8725074   8725144     00:00:00     00:11:00    0

---------------------------------------------------------------------------------
Summary of sl_confirm aging
    Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age  
of eldest SYNC
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
        20         30    8725074    8725139      00:00:00       
00:11:00    0
        20         40    8725080    8725139      00:00:00       
00:10:00    0
        30         20   28037660   28037665      00:00:00       
00:10:00    0
        30         40   28037660   28037665      00:00:00       
00:10:00    0
        40         20   15462272   15462752      00:00:00       
00:11:00    0
        40         30   15467769   15468362      00:00:00       
00:10:00    0

------------------------------------------------------------------------------

Listing of old open connections
        Database             PID            User    Query  
Age                Query
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================

Sending message thus - | -s "Slony State Test Warning - Cluster  
my_cluster"
Message:

Node: 20 sl_log_1 tuples = 3.20841e+06 > 200000
================================================
Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which
exceeds 200000.

You may wish to investigate whether or not a node is down, or perhaps
if sl_confirm entries have not been propagating properly.

Node: 30 sl_log_1 tuples = 3.20841e+06 > 200000
================================================
Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which
exceeds 200000.

You may wish to investigate whether or not a node is down, or perhaps
if sl_confirm entries have not been propagating properly.

Node: 40 sl_log_1 tuples = 3.20841e+06 > 200000
================================================
Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which
exceeds 200000.

You may wish to investigate whether or not a node is down, or perhaps
if sl_confirm entries have not been propagating properly.

>
> I'd *suspect* that a confirmation isn't getting back to node 20, so it
> doesn't know that those tuples can be deleted, and hence never gets
> around to truncating the log table.
>
> The "failure of event communications" is the thing I'd expect to be
> most likely to be the *real* cause.
>
> I could be wrong, but it's pretty easy to check (see the other email
> thread today where I pointed someone at running the state test tool),
> and it is hopefully easy to rectify if that's the case.
>
> If communications are fine, then the question arises, why aren't those
> old tuples in sl_log_2 getting trimmed?  That should help direct the
> questions, irrespective of cause.
>
> FYI, if there's some misconfiguration that has caused this, and the
> truncate then comes thru (~10-15 minutes later), then you may want to
> induce another log switch pretty soon since sl_log_1 probably has a
> lot of data crudding it up.  Again, someone asked how to induce this
> just recently, so take a peek at recent archives for how to do that.
>
>