[Slony1-general] Slow Replication issue

Fri Jul 11 11:06:50 PDT 2008

Cyril SCETBON <scetbon at echo.fr> writes:
>> Is does not seem to hurt performance as you said, but do you see
>> anything else ?
> If I check the different grouping size I see :
>
> count      size
>
>    1              255
>    1              763
>    2                  7
> 290                  3
>
> If I understand, you both mean that 301.017 seconds is the duration of
> the processing of events from node 1 which size is 255 and not the 290
> groups of 3 events ?

Right, 301.017s is the duration of the processing of events from node
1.

It's not reporting any duration information for the other nodes in
that it's not doing the work of:

 - searching for the scope of the SYNC group,
 - opening a cursor to pull the relevant log data for that scope, and
 - applying the changes to the subscriber.

The processing of the 290 groups of 3 events is taking place
concurrently with the Real Work, in separate threads, so it's not
consuming any time that's worth measuring.

Over lunch, Jan and I had a chat about this; it looks like we don't
report quite comprehensive enough information in the logs to make it
easy to interpret what parts of SYNC processing are consuming what
time.

The "straw man" idea we came up with is to do a much better breakdown
of the time, in particlar, to record:

 - time spent in pqexec() against the provider, broken down into...
    - time spent processing what transactions are part of the SYNC group
    - time spent processing the LOG cursor
 - time spent in pqexec() against the subscriber (the I/U/D phase)
 - numbers of pqexecs()
    against provider
    against subscriber
 - possibly, the number of times we grab timestamps

This could be costly to record (since it will call gettimeofday()
quite a lot of times), so we'd add in an option as to whether to do
this analysis or not.

This would put more useful information in the logs, specifically, it
would make it easier to see where the bottleneck really lies.

For instance, if the log indicates that 297s are being spent in
pqexec() on the provider, and only 4s on pqexec() on the subscriber,
that makes it pretty that there's some problem with the queries
searching for data to replicate.  

Reverse the figures, and it becomes clear that, for some reason, the
subscriber is very slow in accepting updates.

In either case, that is helpful in suggesting where to look for
improvements.  To be sure, we'd not focus on the place with 4s; we'd
focus on the 297s!

Improvements welcome; I'll probably put off starting to implement
anything until next week.
-- 
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://cbbrowne.com/info/lsf.html
Rules  of the  Evil Overlord  #145. "My  dungeon cell  decor  will not
feature exposed pipes.  While they add to the  gloomy atmosphere, they
are good  conductors of vibrations and  a lot of  prisoners know Morse
code." <http://www.eviloverlord.com/>