Bill Moran wmoran at potentialtech.com
Fri Feb 27 05:58:13 PST 2009
I'm seeing a problem similar to that described in this thread:
http://lists.slony.info/pipermail/slony1-general/2008-August/008492.html

However, I was unable to isolate the cause.  The solution was to restart
the slons on all nodes, which unstuck things and all tests show that
the cluster is replicating happily again.

I'm trying to determine how/why the problem happened.  In this case it
was triggered by moving an origin around (as part of a routine disaster
drill).  The set moved to the backup without problem, but we didn't notice
that the slon on node 2 had died until we went to move things back.  At
that point, the move set hung, and eventually bailed with "timeout exceeded
while waiting for event confirmation"

If anyone has any insight into this, I'm all ears.  Otherwise, I can set
up a test bed if there's anything I can do to provide more information.

Some basic details:
FreeBSD 6.3
PostgreSQL 8.3.5
Slony 1.2.15

Cluster =
node 1: masterdb in facil0
node 2: localslavedb in facil0
node 3: remoteslavedb in facil1

2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8393-1] [55583] CONFIG moveSet: set_id=1 old_origin=1 new_origin=3
2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8394-1] [55583] DEBUG1 remoteWorkerThread_3: helper thread for provider 3 created
2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8395-1] [55583] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8396-1] [55583] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=3
2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8397-1] [55583] CONFIG storeListen: li_origin=3 li_receiver=2 li_provider=3
2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8398-1] [55583] DEBUG1 remoteWorkerThread_1: helper thread for provider 1 terminated
2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8399-1] [55583] DEBUG1 remoteWorkerThread_1: disconnecting from data provider 1
2009 Feb 27 06:31:52 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8400-1] [55583] DEBUG1 remoteWorkerThread_3: connected to data provider 3 on 'dbname=database_name_prod host=remoteslavedb-ifx.bdb.facil1
2009 Feb 27 06:31:52 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8400-2] user=database_name_slony password=[redacted]'
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8401-1] [55583] INFO localListenThread: got restart notification
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [2-1] [1058] DEBUG1 slon: retry requested
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8402-1] [55583] DEBUG1 main: scheduler mainloop returned
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8403-1] [55583] INFO remoteListenThread_3: disconnecting from 'dbname=database_name_prod host=remoteslavedb-ifx.bdb.facil1
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8403-2] user=database_name_slony password=[redacted]'
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8404-1] [55583] INFO remoteListenThread_1: disconnecting from 'dbname=database_name_prod host=masterdb-db.cluster00.facil0
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8404-2] user=database_name_slony password=[redacted]'
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8405-1] [55583] DEBUG1 syncThread: thread done
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8406-1] [55583] DEBUG1 cleanupThread: thread done
2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8407-1] [55583] DEBUG1 remoteWorkerThread_1: thread done
2009 Feb 27 06:32:02 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8408-1] [55583] DEBUG1 remoteListenThread_3: thread done
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8409-1] [55583] DEBUG1 remoteListenThread_1: thread done
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8410-1] [55583] DEBUG1 main: done
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [3-1] [1058] DEBUG1 slon: restart of worker
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][1058] [local2] [err] slon[1058]: [1-1] [1058] CONFIG main: slon version 1.2.15 starting up
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [2-1] [27776] CONFIG main: local node id = 2
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [3-1] [27776] CONFIG main: launching sched_start_mainloop
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [4-1] [27776] CONFIG main: loading current cluster configuration
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [5-1] [27776] CONFIG storeNode: no_id=1 no_comment='Master node'
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [6-1] [27776] CONFIG storeNode: no_id=3 no_comment='remote slave node'
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [7-1] [27776] CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="dbname=database_name_prod host=masterdb-db.cluster00.facil0
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [7-2] user=database_name_slony password=[redacted]" pa_connretry=10
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [8-1] [27776] CONFIG storePath: pa_server=3 pa_client=2 pa_conninfo="dbname=database_name_prod host=remoteslavedb-ifx.bdb.facil1
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [8-2] user=database_name_slony password=[redacted]" pa_connretry=10
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [9-1] [27776] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [10-1] [27776] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=3
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [11-1] [27776] CONFIG storeListen: li_origin=3 li_receiver=2 li_provider=3
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [12-1] [27776] CONFIG storeSet: set_id=1 set_origin=3 set_comment='All tables'
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [warning] slon[27776]: [13-1] [27776] WARN remoteWorker_wakeup: node 3 - no worker thread
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [14-1] [27776] CONFIG storeSubscribe: sub_set=1 sub_provider=3 sub_forward='t'
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [warning] slon[27776]: [15-1] [27776] WARN remoteWorker_wakeup: node 3 - no worker thread
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [16-1] [27776] CONFIG enableSubscription: sub_set=1
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [warning] slon[27776]: [17-1] [27776] WARN remoteWorker_wakeup: node 3 - no worker thread
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [18-1] [27776] CONFIG main: configuration complete - starting threads
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [debug] slon[27776]: [19-1] [27776] DEBUG1 localListenThread: thread starts
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-1] [27776] FATAL localListenThread: "select "_database_name_prod".cleanupNodelock(); insert into
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-2] "_database_name_prod".sl_nodelock values ( 2, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-3] unique constraint "sl_nodelock-pkey"
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [2-1] [1058] DEBUG1 slon: shutdown requested
2009 Feb 27 06:32:24 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [3-1] [1058] DEBUG1 slon: child termination timeout - kill child
2009 Feb 27 06:32:24 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [4-1] [1058] DEBUG1 slon: done

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/


More information about the Slony1-general mailing list