[Slony1-general] high load on all nodes

Thu Aug 10 20:53:55 PDT 2006

> Hi all,
>
> I'm running a master with two slaves, and one of the slaves forwards
> to an offsite node (so, 4 nodes in all).  Each node is replicating 3
> databases
>
> Since finishing the setup, I've been experiencing high load on all 4
> nodes, and I'm not sure what's causing the problem.  A quick glance
> at 'top' shows that the top cpu-consuming processes are postgres
> processes run by the slon daemon.  A handful of them say "notify
> interrupt waiting"
>
> Any ideas why the load is so high (anywhere from 4.00 to 6.00)?
> Before doing this setup, I've experienced loads around 0.80 to 1.70
>
> Any insight would be greatly appreciated!
> --Richard

With a large number of processes waiting, that explains the apparent high
load average.  They don't have to be working terribly hard for you to have
a big queue of waiting processes.

The one thing that leaps to mind as plausible cause for them to be working
hard would be if the table pg_listener has grown to significant size, and
the notification system is waiting on it.

You might try running, on various of the databases:
  VACUUM FULL VERBOSE pg_catalog.pg_listener;

If that has any evident effects, such as reporting that the table shrunk
from thousands of pages in size to near nothing, then that suggests
undervacuuuming of pg_listener as a direct cause of the problem.

Second thought...  Consider stopping (sig 2, initially, to allow DB
connections to be closed as cleanly as possible) and restarting all the
slon processes, perhaps followed by vacuuming pg_listener...  Recycling
the database connections (which is the result of this) would be expected
to clear notification/listen activity....

Third thought / suspicion...   If you shut down the slons for a few
seconds, check to see if any DB connections remain.  It's possible that
your problem is that the network is a bit unreliable, and you're
experiencing "zombied" connections.  That is, a remote connection falls
over but the PostgreSQL back end doesn't become aware of this for up to
about 2 hours.

In that case, you're left with a barrel of basically useless connections,
possibly thinking they're in a transaction.  Kill off those backends
(signal -2), restart slons, and that should clear things up, at least
until the network flakes again...

I think I most anticipate it's #3.  Checking them in order should be easy
enough, and the earlier steps won't preclude taking later ones...

If you have a bunch of zombied connections, shutting off the "live" slons
won't touch them...