[Slony1-general] Advice/Recommendations on improving Slony performance

Mon Sep 28 09:36:38 PDT 2009

In response to Jason Buberel <jason at buberel.org>:

> I have Slony (1.2.x) running on our production cluster, where it has been
> running for a few years now. The flow of data looks like this:
> 
> master -|----> slave1
>         |----> slave2
>         |----> slave3
>          ...
>         |----> slave9
> 
> Data always flows from master to each identically configured slave. There is
> never any data replication between slaves or from slaves to master. All
> one-way, all down-stream, all the time.
> 
> As the load on our service has grown, I've begun to see node lag counts
> start to grow when machines get busy. Based on my understanding of the
> documentation when the cluster was originally configured, I added storage
> paths from every node to every other node (master --> slave1-9, salve1 -->
> master + save2-9, etc.).
> 
> Question #1: If my flow of data is always/only from master to each slave,
> can I remove the storage path's between slaves, leaving me only with master
> --> slaveN and slaveN --> master? Would this decrease the communication
> overhead or be beneficial in general?

I doubt it would help your performance any, and it will cripple you if you
ever need to do a switchover.

> Question #2: When initially configured, the DSN connection strings used to
> define each node used IP addresses that were part of a 1Gbit network. Since
> then, each of these machines has had an additional 10Gbit network connection
> added to it. Would it be safe to stop the slon daemons, manually update the
> sl_path.pa_conninfo column values to use the IP addresses of these new
> network interfaces, then restart slony daemons?

That's the wrong approach.  Simply use slonik to redefine the paths with
store path() and they will be updated.  You won't have to restart anything.

> Question #3: Are there other configuration parameters I can use to improve
> the overall performance of the cluster?

It doesn't sound like you've identified the bottleneck yet, and that you're
assuming that it's network traffic that's the bottleneck.  If you're right,
then #2 will help.

If you're wrong, then it's CPU or disk.  If I were a betting man, I'd put
my money on disk IO as the bottleneck, as that's usually the case.  If I'm
right, then you have a few other options:
1) Buy faster CPU/disks for the servers.
2) Add an additional slave to the mix to remove some of the chore of
   replicating from the master.

Keep in mind that the master has to manage replicating data to all the
slaves, in addition to whatever work it's doing for the clients of the
database.  If you change your layout to:

master |--> slave0 |----> slave1
       |--> slave0 |----> slave2
       |--> slave0 |----> slave3
        ...
       |--> slave0 |----> slave9

Now you've reduced the replication overhead on the master by 1/9.  Since
slave0 is doing nothing _but_ replicating, it should be able to keep up
better than the master does now.

Actually, depending on what your switching fabric looks like, that change
might improve the situation even if the problem is network related.

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/