[Slony1-general] high load on all nodes

Wed Aug 16 09:43:40 PDT 2006

Thanks for the suggestions!

I've determined that it's the second idea....killall -STOP slon  
(didn't run a sig 2) will bring the load down to about 0.8-1.2 every  
time I do it.  Then, when I do killall -CONT slon, load climbs back  
up to about 6.00

Guess I need a bigger machine?

--Richard

On Aug 10, 2006, at 8:53 PM, cbbrowne at ca.afilias.info wrote:

>> Hi all,
>>
>> I'm running a master with two slaves, and one of the slaves forwards
>> to an offsite node (so, 4 nodes in all).  Each node is replicating 3
>> databases
>>
>> Since finishing the setup, I've been experiencing high load on all 4
>> nodes, and I'm not sure what's causing the problem.  A quick glance
>> at 'top' shows that the top cpu-consuming processes are postgres
>> processes run by the slon daemon.  A handful of them say "notify
>> interrupt waiting"
>>
>> Any ideas why the load is so high (anywhere from 4.00 to 6.00)?
>> Before doing this setup, I've experienced loads around 0.80 to 1.70
>>
>> Any insight would be greatly appreciated!
>> --Richard
>
> With a large number of processes waiting, that explains the  
> apparent high
> load average.  They don't have to be working terribly hard for you  
> to have
> a big queue of waiting processes.
>
> The one thing that leaps to mind as plausible cause for them to be  
> working
> hard would be if the table pg_listener has grown to significant  
> size, and
> the notification system is waiting on it.
>
> You might try running, on various of the databases:
>   VACUUM FULL VERBOSE pg_catalog.pg_listener;
>
> If that has any evident effects, such as reporting that the table  
> shrunk
> from thousands of pages in size to near nothing, then that suggests
> undervacuuuming of pg_listener as a direct cause of the problem.
>
> Second thought...  Consider stopping (sig 2, initially, to allow DB
> connections to be closed as cleanly as possible) and restarting all  
> the
> slon processes, perhaps followed by vacuuming pg_listener...   
> Recycling
> the database connections (which is the result of this) would be  
> expected
> to clear notification/listen activity....
>
> Third thought / suspicion...   If you shut down the slons for a few
> seconds, check to see if any DB connections remain.  It's possible  
> that
> your problem is that the network is a bit unreliable, and you're
> experiencing "zombied" connections.  That is, a remote connection  
> falls
> over but the PostgreSQL back end doesn't become aware of this for  
> up to
> about 2 hours.
>
> In that case, you're left with a barrel of basically useless  
> connections,
> possibly thinking they're in a transaction.  Kill off those backends
> (signal -2), restart slons, and that should clear things up, at least
> until the network flakes again...
>
> I think I most anticipate it's #3.  Checking them in order should  
> be easy
> enough, and the earlier steps won't preclude taking later ones...
>
> If you have a bunch of zombied connections, shutting off the "live"  
> slons
> won't touch them...
>
>
>