[Slony1-general] Slony lagging, no errors, due to bulk data change?

Fri Sep 7 08:46:54 PDT 2012

On 09/06/2012 05:49 PM, Brian Fehrle wrote:
> More information:

> Looking at the pg_stat_activity, it looks like 'fetch 1200 from LOG' 
> has the same transaction start time as the query start time, which 
> suggests that each time it fetches 100 from the log it is starting a 
> new cursor, is this normal / expected?
>

Yes this is normal for your version of slony, but a bug.  It was fixed 
in slony 2.1.  I think the bug number was bug 167.

>
> select now() - xact_start as transaction_age, now() - query_start as 
> query_age, current_query::varchar(80), waiting from pg_stat_activity 
> where usename = 'slony' and current_query like '%LOG%';
>  transaction_age |    query_age    |    current_query     | waiting
> -----------------+-----------------+----------------------+---------
>  00:27:01.652087 | 00:27:01.650933 | fetch 100 from LOG;  | f
> (1 row)
>
>
> So this has to be my issue. The query that hits sl_log_2 is a 
> sequential scan, does this mean every time we retrieve 100 from the 
> cursor, we do another sequential scan on the table to get the next 100 
> rows?
>
> - Brian F
>
>
> On 09/06/2012 03:09 PM, Brian Fehrle wrote:
>> Hi all,
>>
>> postgresql 8.4
>> slony 1.2
>> centos system
>>
>> I have single master to single slave slony cluster where slony is very
>> far behind. There are no errors in the logs or anything, but what looks
>> to be happeneing is that queries slony is executing on the slony log
>> table on the master are taking a long time to complete.
>>
>> The query is "fetch 100 from LOG;" which can take a long time to
>> complete, over 15 minutes at times. Each time this happens we process 1
>> event. This usually takes milliseconds to complete.
>>
>> At this point, the query on the master 'fetch 100 from log' takes about
>> 20 minutes to complete, and after it completes the slave processes 1
>> more event, and then 'fetch 100 from log' kicks off again and takes yet
>> another 20 or so minutes. So the slave is processing an event about once
>> per 20 minutes.
>>
>> As for a cause I believe it's due to the follow up work after adding a
>> column to a table in replication. After adding the column, I believe
>> that the table is being updated to set new values in the newly added
>> column. This could result in millions of new items for slony to process,
>> which may have caused the tables to become so large that they are
>> resulting in sequential scans or something.
>>
>> I'm trying to dig in and see what exactly 'fetch 100 from log' is doing
>> on the master, and if I can speed it up. Is this querying sl_log_<1/2>
>> tables?
>>
>> the pg table pg_notify does not have outstanding dead rows, it's at 0.
>> also out of all the slony tables in the slony schema, the one with the
>> most dead rows is at about 2K dead rows.
>>
>> sl_log_1 has 0 rows, sl_log_2 has about 9,326,260 rows (and zero dead rows).
>>
>> I'm going to see if we can reduce group size, see if for whatever reason
>> that reduces the query result set so it does a index scan vs sequential
>> (if that's even the issue).
>>
>> Any help is greatly appreciated.
>>
>> - Brian F
>> _______________________________________________
>> Slony1-general mailing list
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
>
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20120907/24e0bb6b/attachment-0001.htm