[Slony1-general] drop node error

Mon Jul 18 04:58:08 PDT 2016

	Hi Steve,
	Thanks for looking into this. The context in which this occurs is in an effort to move a node. The specifics are somewhat involved, but the key points are that a node is failed over, then dropped and a new node with the same node ID created elsewhere (another DB on another host). The recent work makes a best effort to automatically reverse the steps if a failure is encountered, so failover and drop the new node, then recreate the original. There are certainly ‘wait for event’s along the way, but it seems likely all nodes aren’t fully caught up before an error. If you’re trying to reproduce, recurringly dropping and recreating a node with the same node ID could help.
	For the specific problem of the race, though, it might be simpler to just ‘kill –STOP’ the slon daemons on node A and then wait for node B to drop its schema (then resume with ‘kill –CONT’ to node A.)

	Tom    ☺

On 7/17/16, 2:16 PM, "Steve Singer" <steve at ssinger.info> wrote:

>On 07/12/2016 08:23 AM, Steve Singer wrote:
>> On 07/08/2016 03:27 PM, Tignor, Tom wrote:
>>>                   Hello slony group,
>>>
>>>                   I’m testing now with slony1-2.2.4. I have just recently
>>> produced an error which effectively stops slon processing on some node A
>>> due to some node B being dropped. The event reproduces only
>>> infrequently. As some will know, a slon daemon for a given node which
>>> becomes aware its node has been dropped will respond by dropping its
>>> cluster schema. There appears to be a race condition between the node B
>>> schema drop and the (surviving) node A receipt of the disableNode (drop
>>> node) event. If the former occurs before the latter, all the remote
>>> worker threads on node A enter an error state. See the log samples
>>> below. I resolved this the first time by deleting all the recent
>>> non-SYNC events from the sl_event tables, and more recently with a
>>> simple node A slon restart.
>>>
>>>                   Please advise if there is any ticket I should provide
>>> this info to, or if I should create a new one. Thanks.
>>>
>>
>> The Slony bug tracker is at
>> http://bugs.slony.info/bugzilla/
>>
>>
>> I assume your saying that when the slon restart it keeps hitting this
>> error and keeps restarting.
>>
>
>Any more hints on how you reproduce this would be helpful.
>I've been trying to reproduce this with no luck.
>
>At the time you issue the drop node, are all other nodes caught up to 
>the drop'd node (and the event node) with respect to configuration events?
>
>Ie if you do
>sync(id=$NODE_ABOUT_TO_DROP);
>wait for event(wait on=$NODE_ABOUT_TO_DROP, origin=$NODE_ABOUT_TO_DROP, 
>confirmed=all);
>
>sync(id=$EVENT_NODE_FOR_DROP);
>wait for event(wait on=$EVENT_NODE_FOR_DROP, 
>origin=$EVENT_NODE_FOR_DROP, confirmed=all);
>
>
>(notice that I am NOT passing any timeout to the wait for).
>
>
>
>
>
>>
>>
>>> ---- node 1 log ----
>>>
>>> 2016-07-08 18:06:31 UTC [30382] INFO   remoteWorkerThread_999999: SYNC
>>> 5000000008 done in 0.002 seconds
>>>
>>> 2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_999999: SYNC
>>> 5000000009 done in 0.002 seconds
>>>
>>> 2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_2: SYNC
>>> 5000017869 done in 0.002 seconds
>>>
>>> 2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_3: SYNC
>>> 5000018148 done in 0.004 seconds
>>>
>>> 2016-07-08 18:06:45 UTC [30382] CONFIG remoteWorkerThread_2: update
>>> provider configuration
>>>
>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_3: "select
>>> last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
>>> schema "_ams_clu\
>>>
>>> ster" does not exist
>>>
>>> LINE 1: select last_value from "_ams_cluster".sl_log_status
>>>
>>>                                  ^
>>>
>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_3: SYNC aborted
>>>
>>> 2016-07-08 18:06:45 UTC [30382] CONFIG version for "dbname=ams
>>>
>>>         host=198.18.102.45
>>>
>>>         user=ams_slony
>>>
>>>         sslmode=verify-ca
>>>
>>>         sslcert=/usr/local/akamai/.ams_certs/complete-ams_slony.crt
>>>
>>>         sslkey=/usr/local/akamai/.ams_certs/ams_slony.private_key
>>>
>>>         sslrootcert=/usr/local/akamai/etc/ssl_ca/canonical_ca_roots.pem"
>>> is 90119
>>>
>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_2: "select
>>> last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
>>> schema "_ams_clu\
>>>
>>> ster" does not exist
>>>
>>> LINE 1: select last_value from "_ams_cluster".sl_log_status
>>>
>>>                                  ^
>>>
>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_2: SYNC aborted
>>>
>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteListenThread_999999:
>>> "select ev_origin, ev_seqno, ev_timestamp,        ev_snapshot,
>>> "pg_catalog".txid_sna\
>>>
>>> pshot_xmin(ev_snapshot),
>>> "pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,
>>>    ev_data1, ev_data2,        ev_data3, ev_data4,        ev\
>>>
>>> _data5, ev_data6,        ev_data7, ev_data8 from "_ams_cluster".sl_event
>>> e where (e.ev_origin = '999999' and e.ev_seqno > '5000000009') or
>>> (e.ev_origin = '2'\
>>>
>>> and e.ev_seqno > '5000017870') or (e.ev_origin = '3' and e.ev_seqno >
>>> '5000018151') order by e.ev_origin, e.ev_seqno limit 40" - ERROR:
>>> schema "_ams_cluste\
>>>
>>> r" does not exist
>>>
>>> LINE 1: ...v_data5, ev_data6,        ev_data7, ev_data8 from "_ams_clus...
>>>
>>>                                                                ^
>>>
>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_3: "start
>>> transaction; set enable_seqscan = off; set enable_indexscan = on; "
>>> PGRES_FATAL_ERROR ERR\
>>>
>>> OR:  current transaction is aborted, commands ignored until end of
>>> transaction block
>>>
>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_3: SYNC aborted
>>>
>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_2: "start
>>> transaction; set enable_seqscan = off; set enable_indexscan = on; "
>>> PGRES_FATAL_ERROR ERR\
>>>
>>> OR:  current transaction is aborted, commands ignored until end of
>>> transaction block
>>>
>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_2: SYNC aborted
>>>
>>> ----
>>>
>>> ---- node 999999 log ----
>>>
>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_1: SYNC
>>> 5000081216 done in 0.004 seconds
>>>
>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_2: SYNC
>>> 5000017870 done in 0.004 seconds
>>>
>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_3: SYNC
>>> 5000018150 done in 0.004 seconds
>>>
>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_1: SYNC
>>> 5000081217 done in 0.003 seconds
>>>
>>> 2016-07-08 18:06:44 UTC [558] WARN   remoteWorkerThread_3: got DROP NODE
>>> for local node ID
>>>
>>> NOTICE:  Slony-I: Please drop schema "_ams_cluster"
>>>
>>> NOTICE:  drop cascades to 171 other objects
>>>
>>> DETAIL:  drop cascades to table _ams_cluster.sl_node
>>>
>>> drop cascades to table _ams_cluster.sl_nodelock
>>>
>>> drop cascades to table _ams_cluster.sl_set
>>>
>>> drop cascades to table _ams_cluster.sl_setsync
>>>
>>> drop cascades to table _ams_cluster.sl_table
>>>
>>> ----
>>>
>>>               Tom J
>>>
>>>
>>>
>>> _______________________________________________
>>> Slony1-general mailing list
>>> Slony1-general at lists.slony.info
>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>
>>
>> _______________________________________________
>> Slony1-general mailing list
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
>>
>