[Slony1-general] Slony load on master db

Fri Mar 4 19:19:31 PST 2005

Dr.Christian Storm wrote:

>
> On Mar 4, 2005, at 10:58 AM, Darcy Buskermolen wrote:
>
>> On Friday 04 March 2005 09:38, Dr.Christian Storm wrote:
>>
>>> I walked in this morning and looked at the load on our master db and
>>> saw that between 4:30-6PM last night the average load went steadily up
>>> from 0.5 to 2.5.  Using top I see that the two FETCH queries from Slony
>>> are swallowing two of our four CPUs:
>>>
>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 19029 postgres  16   0  671m 600m 668m R 73.0  7.9 788:58.38 postgres:
>>> postgres tii 192.168.1.42 FETCH
>>> 19206 postgres  16   0  671m 600m 668m R 72.7  7.9 788:46.65 postgres:
>>> postgres tii 192.168.1.43 FETCH
>>>
>>> The %CPU vacillates  between 70-90%.
>>>
>>> A further inspection of the stat activity confirms this:
>>>
>>> tii=# select procpid, query_start, age(now(),query_start),
>>> current_query from pg_stat_activity where current_query not like
>>> '%IDLE%' order by query_start;
>>>   procpid |          query_start          |       age       |
>>> current_query
>>> ---------+-------------------------------+-----------------
>>> +----------------------
>>>     19029 | 2005-03-04 09:29:44.71891-08  | 00:00:06.708432 | fetch 100
>>> from LOG;
>>>     19206 | 2005-03-04 09:29:47.768228-08 | 00:00:03.659114 | fetch 100
>>> from LOG;
>>>
>>> The age ranges anywhere from 0 to 9 seconds.
>>>
>>> The load on the two slaves hasn't changed.
>>
>>
>> What's the row count in _clustername.sl_log1 and  _clustername.sl_event
>
>
> tii=# select count(*) from sl_log_1;
>  count
> --------
>  375344
> (1 row)
>
> tii=# select count(*) from sl_event;
>  count
> -------
>    746
> (1 row)

That is consistent with data not getting cleaned out.

Try this query:

    select con_origin, con_received, min(con_seqno) as minseq,
           max(con_seqno) as maxseq, date_trunc('minutes', 
min(now()-con_timestamp)) as age1,
           date_trunc('minutes', max(now()-con_timestamp)) as age2
    from sl_confirm
    group by con_origin, con_received
    order by con_origin, con_received;

On one of our systems, this generates the following:

Summary of sl_confirm aging
   Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of eldest SYNC
==================================================================
        1          5    7934376    7934861      00:00:00      00:13:00
        1         36    7934375    7934861      00:00:00      00:13:00
        1         78    7934384    7934870      00:00:00      00:13:00
        1        501    7934382    7934870      00:00:00      00:13:00
        5          1     131472     131478      00:01:00      00:13:00
        5         36     131472     131478      00:01:00      00:13:00
        5         78     131472     131478      00:01:00      00:13:00
        5        501     131472     131478      00:01:00      00:13:00
       36          1      15230      15237      00:00:00      00:14:00
       36          5      15230      15237      00:00:00      00:14:00
       36         78      15230      15237      00:00:00      00:14:00
       36        501      15230      15237      00:00:00      00:14:00
       78          1      75423      75430      00:00:00      00:14:00
       78          5      75423      75430      00:00:00      00:14:00
       78         36      75423      75430      00:00:00      00:14:00
       78        501      75423      75430      00:00:00      00:14:00
      501          1     117875     117876      00:04:00      00:14:00
      501          5     117875     117876      00:04:00      00:14:00
      501         36     117875     117876      00:04:00      00:14:00
      501         78     117875     117876      00:04:00      00:14:00

I expect you'll find that there's a node that is a day or two behind in 
confirming SYNCs.

There is a script in CVS called "test_slony_state.pl" which rummages 
around to find various information such as the above that is quite 
useful in detecting some misconfigurations.

I'd almost bet lunch that one of the subscriber nodes isn't confirming 
SYNCs against the others, and that this is the result of there being no 
suitable sl_listen path between the subscribers...