joe joe
Mon Jan 10 20:09:09 PST 2005
Greetings all, sorry for the long post, but hopefully you can shed some
light on my dilemma here:
	Last Friday I started replication on our production servers after an
extensive playing period on some test servers, and everything seemed to
be fine before i left for the weekend, the data set had apparently
finished copying over from the subscribe set etc. and things seemed to
be humming along just fine. 
	I came into work this morning and found the master db server to be dog
slow (load average had been above 10 since Saturday sometime).  After a
bit of investigation I think that the slony replication service is
causing it.

postgres 19983 22.8  3.4 83488 69508 ?       S    Jan07 1140:0 postgres:
postgres pl ::ffff:216.239.10.115 FETCH
postgres 20073 23.5  3.4 83480 69524 ?       S    Jan07 1169:57
postgres: postgres pl ::ffff:216.239.10.116 FETCH

>From postgres I found that those 2 processes are running the following
queries:

19983 | fetch 100 from LOG;
20073 | fetch 100 from LOG; 

Which I'm guessing is slony grabbing things from the transaction log for
replication.  So I figured I'd check to see how many tuples were in the
log.

pl=# select reltuples::int from pg_class where relname='sl_log_1';
 reltuples 
-----------
  16945348
(1 row)

pl=# select reltuples::int from pg_class where relname='sl_log_2';
 reltuples 
-----------
         0

To confirm that replication is falling behind I checked one of the
replicated tables on the slaves to see if it was in sync with the master

Master: pl=# select activity_id from pl02_activity_table order by
activity_id desc limit 1;
 activity_id 
-------------
    13835041
(1 row)

Slave1: pl=# select activity_id from pl02_activity_table order by
activity_id desc limit 1;
 activity_id 
-------------
    13818012
(1 row)

Slave2: pl=# select activity_id from pl02_activity_table order by
activity_id desc limit 1;
 activity_id 
-------------
    13818008
(1 row)

Software versions:
Slony-1.05
postgres 7.4.6
MasterDB: 3.8 Ghz Pentium 4
	  6GB of RAM 
	  5 36GB 10K SCSI in HW RAID5
	  RedHat 9
Slave1 & Slave2:  Dual Opteron 246
	  8 GB Ram
	  6 73GB 15k FC drives in HW RAID 1+0 (512 MB battery-backed cache on
controller)
	  SLES 8 for AMD64

DB-size on Master: 78 GB
	"" Slave1: 74 GB
	"" Slave2: 69 GB

Are things removed from the log once they're replicated or do they stay
in there for a while?

Am I correct in guessing that for some reason its attempting to do a seq
scan on the sl_log_1 table to look for new rows to update, and thats why
those processes are taking forever?

Is this expected? or is it possible that I goofed something during the
install?

If more information is needed let me know and i'll do my best to provide
it.

-Joe Markwardt


  



More information about the Slony1-general mailing list