Wed Jul 29 14:09:57 PDT 2009
- Previous message: [Slony1-general] Release candidates for 2.0.3, 1.2.17
- Next message: [Slony1-general] Log Switch not happening version 1.2.15
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Quoting Jason Culverhouse <jason at merchantcircle.com>:
> > I see these notices on node 20
> > NOTICE: Slony-I: log switch to sl_log_1 still in progress -
> > sl_log_2 not truncated
> > NOTICE: Slony-I: log switch to sl_log_1 still in progress -
> > sl_log_2 not truncated
> > NOTICE: Slony-I: log switch to sl_log_1 still in progress -
> > sl_log_2 not truncated
> >
> > I see FETCH statements taking up 100% cpu on node 30
> >
> > Any Idea on what to do here?
>
> This happens when it's continuing to notice tuples lingering in
> sl_log_2.
>
> You should be able to check on node 20 and see that "select * from
> _myschema.sl_log_2 limit 1;" returns a tuple.
>
Returns a tuple
on Node 40 and 30 I get this from sl_log_status
sequence_name | last_value | increment_by | max_value | min_value |
cache_value | log_cnt | is_cycled | is_called
---------------+------------+--------------+-----------+-----------
+-------------+---------+-----------+-----------
sl_log_status | 3 | 1 | 3 | 0
| 1 | 0 | f | t
On node 20
sequence_name | last_value | increment_by | max_value | min_value |
cache_value | log_cnt | is_cycled | is_called
---------------+------------+--------------+-----------+-----------
+-------------+---------+-----------+-----------
sl_log_status | 2 | 1 | 3 | 0
| 1 | 0 | f | t
Node 20 only falls behind node 40 and is always a very similar number
of log entries behind
st_origin | st_received | st_last_event | st_last_event_ts
| st_last_received | st_last_received_ts |
st_last_received_event_ts | st_lag_num_events | st_lag_time
-----------+-------------+---------------+----------------------------
+------------------+----------------------------
+----------------------------+-------------------+-----------------
40 | 20 | 15469518 | 2009-07-29 14:03:51.453982
| 15463732 | 2009-07-29 14:03:09.683189 | 2009-07-29
12:05:51.162247 | 5786 | 01:58:00.426609
40 | 30 | 15469518 | 2009-07-29 14:03:51.453982
| 15469495 | 2009-07-29 14:04:21.481599 | 2009-07-29
14:03:23.781414 | 23 | 00:00:27.807442
This time has been pretty consistent all day
> I'd suggest running test_slony_state{-dbi}.pl, to check the general
> condition of things. It'll doubtless gripe about there being a lot of
> tuples in sl_log_2.
Tests for node 40 - DSN = host=node40 dbname=my_database user=postgres
port=5432
========================================
pg_listener info:
Pages: 12
Tuples: 2
Size Tests
================================================
sl_log_1 137219 3208410.000000
sl_log_2 570 7030.000000
sl_seqlog 16 258.000000
Listen Path Analysis
===================================================
No problems found with sl_listen
--------------------------------------------------------------------------------
Summary of event info
Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age
=
=
=
=
=
=
=
=
========================================================================
30 28037660 28037665 00:00:00 00:10:00 0
40 15462272 15462752 01:56:00 02:05:00 1
20 8725074 8725144 00:00:00 00:11:00 0
---------------------------------------------------------------------------------
Summary of sl_confirm aging
Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age
of eldest SYNC
=
=
=
=
=
=
=
=
=
========================================================================
20 30 8725074 8725139 00:00:00
00:11:00 0
20 40 8725080 8725139 00:00:00
00:10:00 0
30 20 28037660 28037665 00:00:00
00:10:00 0
30 40 28037660 28037665 00:00:00
00:10:00 0
40 20 15462272 15462752 00:00:00
00:11:00 0
40 30 15467769 15468362 00:00:00
00:10:00 0
------------------------------------------------------------------------------
Listing of old open connections
Database PID User Query
Age Query
=
=
=
=
=
=
=
=
========================================================================
Tests for node 30 - DSN = host=node30 dbname=my_database user=postgres
port=5432
========================================
pg_listener info:
Pages: 12
Tuples: 2
Size Tests
================================================
sl_log_1 137219 3208410.000000
sl_log_2 570 7030.000000
sl_seqlog 16 258.000000
Listen Path Analysis
===================================================
No problems found with sl_listen
--------------------------------------------------------------------------------
Summary of event info
Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age
=
=
=
=
=
=
=
=
========================================================================
30 28037660 28037665 00:00:00 00:10:00 0
40 15462272 15462752 01:56:00 02:05:00 1
20 8725074 8725144 00:00:00 00:11:00 0
---------------------------------------------------------------------------------
Summary of sl_confirm aging
Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age
of eldest SYNC
=
=
=
=
=
=
=
=
=
========================================================================
20 30 8725074 8725139 00:00:00
00:11:00 0
20 40 8725080 8725139 00:00:00
00:10:00 0
30 20 28037660 28037665 00:00:00
00:10:00 0
30 40 28037660 28037665 00:00:00
00:10:00 0
40 20 15462272 15462752 00:00:00
00:11:00 0
40 30 15467769 15468362 00:00:00
00:10:00 0
------------------------------------------------------------------------------
Listing of old open connections
Database PID User Query
Age Query
=
=
=
=
=
=
=
=
========================================================================
Tests for node 20 - DSN = host=node20 dbname=my_database user=postgres
port=5432
========================================
pg_listener info:
Pages: 12
Tuples: 2
Size Tests
================================================
sl_log_1 137219 3208410.000000
sl_log_2 570 7030.000000
sl_seqlog 16 258.000000
Listen Path Analysis
===================================================
No problems found with sl_listen
--------------------------------------------------------------------------------
Summary of event info
Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age
=
=
=
=
=
=
=
=
========================================================================
30 28037660 28037665 00:00:00 00:10:00 0
40 15462272 15462752 01:56:00 02:05:00 1
20 8725074 8725144 00:00:00 00:11:00 0
---------------------------------------------------------------------------------
Summary of sl_confirm aging
Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age
of eldest SYNC
=
=
=
=
=
=
=
=
=
========================================================================
20 30 8725074 8725139 00:00:00
00:11:00 0
20 40 8725080 8725139 00:00:00
00:10:00 0
30 20 28037660 28037665 00:00:00
00:10:00 0
30 40 28037660 28037665 00:00:00
00:10:00 0
40 20 15462272 15462752 00:00:00
00:11:00 0
40 30 15467769 15468362 00:00:00
00:10:00 0
------------------------------------------------------------------------------
Listing of old open connections
Database PID User Query
Age Query
=
=
=
=
=
=
=
=
========================================================================
Sending message thus - | -s "Slony State Test Warning - Cluster
my_cluster"
Message:
Node: 20 sl_log_1 tuples = 3.20841e+06 > 200000
================================================
Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which
exceeds 200000.
You may wish to investigate whether or not a node is down, or perhaps
if sl_confirm entries have not been propagating properly.
Node: 30 sl_log_1 tuples = 3.20841e+06 > 200000
================================================
Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which
exceeds 200000.
You may wish to investigate whether or not a node is down, or perhaps
if sl_confirm entries have not been propagating properly.
Node: 40 sl_log_1 tuples = 3.20841e+06 > 200000
================================================
Number of tuples in Slony-I table sl_log_1 is 3.20841e+06 which
exceeds 200000.
You may wish to investigate whether or not a node is down, or perhaps
if sl_confirm entries have not been propagating properly.
>
> I'd *suspect* that a confirmation isn't getting back to node 20, so it
> doesn't know that those tuples can be deleted, and hence never gets
> around to truncating the log table.
>
> The "failure of event communications" is the thing I'd expect to be
> most likely to be the *real* cause.
>
> I could be wrong, but it's pretty easy to check (see the other email
> thread today where I pointed someone at running the state test tool),
> and it is hopefully easy to rectify if that's the case.
>
> If communications are fine, then the question arises, why aren't those
> old tuples in sl_log_2 getting trimmed? That should help direct the
> questions, irrespective of cause.
>
> FYI, if there's some misconfiguration that has caused this, and the
> truncate then comes thru (~10-15 minutes later), then you may want to
> induce another log switch pretty soon since sl_log_1 probably has a
> lot of data crudding it up. Again, someone asked how to induce this
> just recently, so take a peek at recent archives for how to do that.
>
>
- Previous message: [Slony1-general] Release candidates for 2.0.3, 1.2.17
- Next message: [Slony1-general] Log Switch not happening version 1.2.15
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list