[Slony1-general] Slow Replication issue

Fri Jul 11 11:50:58 PDT 2008

Christopher Browne wrote:
> Cyril SCETBON <scetbon at echo.fr> writes:
>   
>>> Is does not seem to hurt performance as you said, but do you see
>>> anything else ?
>>>       
>> If I check the different grouping size I see :
>>
>> count      size
>>
>>    1              255
>>    1              763
>>    2                  7
>> 290                  3
>>
>> If I understand, you both mean that 301.017 seconds is the duration of
>> the processing of events from node 1 which size is 255 and not the 290
>> groups of 3 events ?
>>     
>
> Right, 301.017s is the duration of the processing of events from node
> 1.
>
> It's not reporting any duration information for the other nodes in
> that it's not doing the work of:
>
>  - searching for the scope of the SYNC group,
>  - opening a cursor to pull the relevant log data for that scope, and
>  - applying the changes to the subscriber.
>
> The processing of the 290 groups of 3 events is taking place
> concurrently with the Real Work, in separate threads, so it's not
> consuming any time that's worth measuring.
>
> Over lunch, Jan and I had a chat about this; it looks like we don't
> report quite comprehensive enough information in the logs to make it
> easy to interpret what parts of SYNC processing are consuming what
> time.
>
> The "straw man" idea we came up with is to do a much better breakdown
> of the time, in particlar, to record:
>
>  - time spent in pqexec() against the provider, broken down into...
>     - time spent processing what transactions are part of the SYNC group
>     - time spent processing the LOG cursor
>  - time spent in pqexec() against the subscriber (the I/U/D phase)
>  - numbers of pqexecs()
>     against provider
>     against subscriber
>  - possibly, the number of times we grab timestamps
>
> This could be costly to record (since it will call gettimeofday()
> quite a lot of times), so we'd add in an option as to whether to do
> this analysis or not.
>
> This would put more useful information in the logs, specifically, it
> would make it easier to see where the bottleneck really lies.
>
> For instance, if the log indicates that 297s are being spent in
> pqexec() on the provider, and only 4s on pqexec() on the subscriber,
> that makes it pretty that there's some problem with the queries
> searching for data to replicate.  
>
> Reverse the figures, and it becomes clear that, for some reason, the
> subscriber is very slow in accepting updates.
>   
I think that would be really a great improvement for log analysis
> In either case, that is helpful in suggesting where to look for
> improvements.  To be sure, we'd not focus on the place with 4s; we'd
> focus on the 297s!
>
> Improvements welcome; I'll probably put off starting to implement
> anything until next week.
>   
For now, I'll try to enable the log_duration and log_statement 
parameters in postgresql.conf file to investiguate.

Thanks


-- 
Cyril SCETBON