Len Walter len.walter
Tue Jul 26 03:30:00 PDT 2005
G'day,

I've submitted a bug at
http://gborg.postgresql.org/project/slony1/bugs/bugupdate.php?1383.

The environment is RHEL4ES, Postgres 7.4.6 from distro (rpm name
postgresql-7.4.6-1.RHEL4.1), installed on one master, four slaves,
slony1.1.0 compiled from src->rpm.

Our four slave slony daemons aren't releasing the db connections to
the master, and keep opening new ones, rapidly leading to "too many
clients" errors on the master, which makes replication and all other
db access fail.

It seems to run fine for about twelve hours, then connections start
building up. After about ten hours, we end up with many connections
all in "async_notify waiting" (from ps -efa on the master), holding
locks as follows:

freeside=# select mode,pid,relname,usename,current_query,query_start
from pg_locks, pg_class,pg_stat_activity where pg_class.oid = relation
and pid=procpid and mode like 'Exclusive%' order by query_start;
mode | pid | relname | usename | current_query | query_start
---------------+-------+-------------+----------+---------------------+-------------------------------
ExclusiveLock | 23639 | pg_listener | postgres | <IDLE> | 2005-07-26
01:49:10.760922+10
ExclusiveLock | 3997 | pg_listener | postgres | <IDLE> | 2005-07-26
01:53:09.634105+10
ExclusiveLock | 23978 | pg_listener | postgres | <IDLE> | 2005-07-26
01:54:10.626429+10
ExclusiveLock | 22851 | pg_listener | postgres | <IDLE> | 2005-07-26
01:57:41.232085+10
ExclusiveLock | 24546 | pg_listener | postgres | <IDLE> | 2005-07-26
01:58:33.698555+10
ExclusiveLock | 3961 | pg_listener | postgres | <IDLE> | 2005-07-26
01:58:48.898225+10
ExclusiveLock | 24463 | pg_listener | postgres | <IDLE> | 2005-07-26
01:59:11.731+10
ExclusiveLock | 24574 | pg_listener | postgres | <IDLE> | 2005-07-26
01:59:40.865348+10
ExclusiveLock | 24632 | pg_listener | postgres | <IDLE> | 2005-07-26
02:03:11.090778+10
ExclusiveLock | 24651 | pg_listener | postgres | <IDLE> | 2005-07-26
02:04:06.607669+10
ExclusiveLock | 24670 | pg_listener | postgres | <IDLE> | 2005-07-26
02:04:45.562575+10
ExclusiveLock | 24680 | pg_listener | postgres | <IDLE> | 2005-07-26
02:05:17.364743+10
...
ExclusiveLock | 4044 | pg_listener | postgres | commit transaction; |
2005-07-26 10:37:47.625651+10
(93 rows)

I have DEBUG2 logging from all slony daemons. Here are some of the
connections made by one of the slaves - nb there are 24 conns since
starting and 24 conns * 4 slaves =~ 100 connections, the postgres
master's limit.

2005-07-25 11:01:50 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
2005-07-25 19:50:42 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
2005-07-26 00:26:23 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
2005-07-26 01:06:44 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
2005-07-26 01:52:05 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
2005-07-26 02:02:15 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
2005-07-26 02:08:00 EST DEBUG1 remoteListenThread_1: connected to
'host=vispcentral01-syd dbname=freeside user=postgres port=8477
password=password'
..until it reaches 24 conns and starts failing because, with the other
slaves doing the same, the master server is full.

The master postgres server log, starting at about the same time as the
conns started building up. The conns below are from slave dbs.
2005-07-25 19:27:42 LOG: 00000: connection received:
host=vispaaa02-syd port=52005
LOCATION: BackendFork, postmaster.c:2395
2005-07-25 19:27:42 LOG: 00000: connection authorized: user=postgres
database=freeside
LOCATION: BackendFork, postmaster.c:2457
2005-07-25 19:42:28 LOG: 00000: recycled transaction log file "000000030000008D"
LOCATION: MoveOfflineLogs, xlog.c:1656
2005-07-25 19:52:55 LOG: 08006: could not send data to client:
Connection reset by peer
LOCATION: internal_flush, pqcomm.c:1000
2005-07-25 19:52:58 LOG: 08006: could not send data to client: Broken pipe
LOCATION: internal_flush, pqcomm.c:1000
2005-07-25 19:53:01 LOG: 00000: connection received:
host=vispaaa01-syd port=47697

It looks like the slave daemons should be releasing the connections to
the master, but aren't. Is that right?

thanks in advance,
Len

-- 
Len Walter len.walter at gmail.com http://crookedtimbre.net


More information about the Slony1-general mailing list