[Slony1-bugs] [Bug 135] New: slon can get stuck waiting for a sl_config_lock it already has

Mon Jun 21 10:14:40 PDT 2010

http://www.slony.info/bugzilla/show_bug.cgi?id=135

           Summary: slon can get stuck waiting for a sl_config_lock it
                    already has
           Product: Slony-I
           Version: 2.0
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: low
         Component: slon
        AssignedTo: slony1-bugs at lists.slony.info
        ReportedBy: ssinger at ca.afilias.info
                CC: slony1-bugs at lists.slony.info
   Estimated Hours: 0.0

I am not exactly sure what is going on here but this is what I have observed.

As part of a test (FailNodeTest.js) a subscription process seemed to have hung.

select pid,relname, mode, granted from pg_locks l, pg_class c where
l.relation=c.oid;
  pid  |          relname           |        mode         | granted 
-------+----------------------------+---------------------+---------
 13197 | sl_config_lock             | AccessExclusiveLock | t
 13378 | pg_class                   | AccessShareLock     | t
 13206 | sl_config_lock             | AccessExclusiveLock | f
 13378 | pg_class_relname_nsp_index | AccessShareLock     | t
 13378 | pg_locks                   | AccessShareLock     | t
 13378 | pg_class_oid_index         | AccessShareLock     | t

Note pid 13197 has the lock.  pid 13206 is waiting for the lock

ps aux |grep 13197
9994     13197  0.0  0.3  62020 10232 ?        Ss   12:45   0:00 postgres:
slony test4 ::1(50428) idle in transaction
9994     13384  0.0  0.0   7336   880 pts/0    R+   12:55   0:00 grep 13197

ssinger.local at ssinger:~/src/clustertest/clustertest$ lsof |grep 50428
slon      13137 ssinger.local    6u     IPv6            4391222      0t0       
TCP localhost:50428->localhost:postgresql (ESTABLISHED)
postgres  13197 ssinger.local    7u     IPv6            4391254      0t0       
TCP localhost:postgresql->localhost:50428 (ESTABLISHED)

So process 13137 which is the slon has the lock

but,

ps aux |grep 13206
9994     13206  0.0  0.2  60740  8192 ?        Ss   12:45   0:00 postgres:
slony test4 ::1(50432) LOCK TABLE waiting
9994     13399  0.0  0.0   7340   900 pts/0    S+   12:56   0:00 grep 13206
ssinger.local at ssinger:~/src/clustertest/clustertest$ lsof |grep 50432
slon      13137 ssinger.local   10u     IPv6            4391239      0t0       
TCP localhost:50432->localhost:postgresql (ESTABLISHED)
postgres  13206 ssinger.local    7u     IPv6            4391278      0t0       
TCP localhost:postgresql->localhost:50432 (ESTABLISHED)
ssinger.local at ssinger:~/src/clustertest/clustertest$ ps aux |grep 13137
9994     13137  0.0  0.0 160664  1448 pts/0    Sl   12:45   0:00
/usr/local/pgsql_cvs/bin/slon disorder_replica -d4 dbname=test4 host=localhost
user=slony password=slony

This shows process process 13137 is also connected to a second backend that is
waiting for the sl_config_lock.

In the logs I observe 

2010-06-21 12:46:16,425 [db4 stdout] DEBUG
info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-21
12:46:16 EDTDEBUG2 remoteWorkerThread_3: Received event #3 from 5000000007
type:SUBSCRIBE_SET
2010-06-21 12:46:16,425 [db4 stdout] DEBUG
info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-21
12:46:16 EDTWARN   remoteWorkerThread_1: copy set: data provider 3 only on sync
5000000036 - sleep 5 seconds

that log message on remoteWorkerThread_1 keeps repeating.

I wonder if we are 'sleeping' while holding a lock that prevents the sync from
getting processed.

The cluster configuration is

1===>3====>4

where a path ALSO exists between nodes 1 and 4.  (There is also a node 5
receiving from 3, and 2 receiving from 1).

-- 
Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are the assignee for the bug.