Thu May 1 23:35:07 PDT 2008
- Previous message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Next message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Now with 1.2.12 and 1.2.14rc (have not tested 1.2.13 yet (but since > it's apparent in 1.2.12 and in 1.2.14rc even with the "patch/possible > fix", I'm guessing the issue is very much in 1.2.13 and there is a > large issue as failover and switchover are key elements in this > application. Just to add my test report on 1.2.14RC - I agree that it still exhibits the "move set" problem. I tested it out with a 3-node PostgreSQL 8.3.1 cluster: node3 <-- node1 --> node2 There's one replication set. It contains all the tables used in a pgbench installation (the test app). The initial state was: origin on node1, with node2 and node3 both subscribing directly to node1. It works fine in this initial state: node1=# insert into accounts values (100000000, 100000, 100000, 'hello sailor'); INSERT 0 1 node1=# \c node2 You are now connected to database "node2". node2=# select * from accounts where aid = 100000000; aid | bid | abalance | filler -----------+--------+----------+-------------------------------------------------------------------------------------- 100000000 | 100000 | 100000 | hello sailor (1 row) node2=# \c node3 You are now connected to database "node3". node3=# select * from accounts where aid = 100000000; aid | bid | abalance | filler -----------+--------+----------+-------------------------------------------------------------------------------------- 100000000 | 100000 | 100000 | hello sailor (1 row) node3=# \q So, we're getting data. It worked with a large pgbench run as well. The move set, however, causes problems: [ccd at hpsystem slontest]$ slonik_move_set 1 1 2 | slonik <stdin>:5: Locking down set 1 on node 1 <stdin>:7: Locked down - moving it <stdin>:9: Replication set 1 moved from node 1 to 2. Remember to <stdin>:10: update your configuration file, if necessary, to note the new location <stdin>:11: for the set. [ccd at hpsystem slontest]$ psql -U postgres node1 Password for user postgres: Welcome to psql 8.3.1, the PostgreSQL interactive terminal. node1=# select * from _replication.sl_subscribe ; sub_set | sub_provider | sub_receiver | sub_forward | sub_active ---------+--------------+--------------+-------------+------------ 1 | 1 | 3 | t | t 1 | 2 | 1 | t | t (2 rows) node1=# \c node2 You are now connected to database "node2". node2=# select * from _replication.sl_subscribe ; sub_set | sub_provider | sub_receiver | sub_forward | sub_active ---------+--------------+--------------+-------------+------------ 1 | 1 | 3 | t | t 1 | 2 | 1 | t | t (2 rows) node2=# \c node3 You are now connected to database "node3". node3=# select * from _replication.sl_subscribe ; sub_set | sub_provider | sub_receiver | sub_forward | sub_active ---------+--------------+--------------+-------------+------------ 1 | 1 | 2 | t | t 1 | 1 | 3 | t | t Node3 never reflects the new state of the cluster. Restarting slon daemons was ineffective. Looking at locks, the same symptomatic lock is present (as in all other reports of this issue): node3=# select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+---------------------+--------- virtualxid | | | | | 16/121 | | | | | 16/121 | 3480 | ExclusiveLock | t relation | 16387 | 20718 | | | | | | | | 16/121 | 3480 | AccessExclusiveLock | f relation | 16387 | 20663 | | | | | | | | 21/108 | 3489 | AccessShareLock | t virtualxid | | | | | 9/130 | | | | | 9/130 | 3426 | ExclusiveLock | t relation | 16386 | 20311 | | | | | | | | 9/130 | 3426 | RowExclusiveLock | t relation | 16386 | 20303 | | | | | | | | 9/130 | 3426 | RowExclusiveLock | t transactionid | | | | | | 36277 | | | | 9/130 | 3426 | ExclusiveLock | t virtualxid | | | | | 4/396 | | | | | 4/396 | 3382 | ExclusiveLock | t relation | 16387 | 10969 | | | | | | | | 23/134 | 5232 | AccessShareLock | t virtualxid | | | | | 21/108 | | | | | 21/108 | 3489 | ExclusiveLock | t relation | 16386 | 2614 | | | | | | | | 9/130 | 3426 | ExclusiveLock | t relation | 16387 | 20669 | | | | | | | | 21/108 | 3489 | AccessShareLock | t relation | 16386 | 2614 | | | | | | | | 4/396 | 3382 | ExclusiveLock | f virtualxid | | | | | 23/134 | | | | | 23/134 | 5232 | ExclusiveLock | t relation | 16387 | 20718 | | | | | | | | 21/108 | 3489 | AccessExclusiveLock | t (15 rows) node3=# select * from pg_locks where granted = 'f'; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+---------------------+--------- relation | 16387 | 20718 | | | | | | | | 16/121 | 3480 | AccessExclusiveLock | f (1 row) node3=# select relname,relnamespace from pg_class where oid = 20718; relname | relnamespace ----------------+-------------- sl_config_lock | 20491 (1 row) looking at logfiles: Node2's slon daemon logfile contains: ================================= 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: queue event 1,67 MOVE_SET 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: UNLISTEN 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event 1,66 SYNC 2008-05-02 14:34:22 EST DEBUG2 calc sync size - last time: 1 last length: 7719 ideal: 7 proposed size: 3 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 processing 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: syncing set 1 with 4 table(s) from provider 1 2008-05-02 14:34:22 EST DEBUG2 ssy_action_list length: 0 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: current local log_status is 0 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1_1: current remote log_status = 0 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.001 seconds delay for first row 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.001 seconds until close cursor 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: inserts=0 updates=0 deletes=0 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: new sl_rowid_seq value: 1000000000011000 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 done in 0.003 seconds 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event 1,67 MOVE_SET 2008-05-02 14:34:22 EST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=2 (0 threads + worker signaled) 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=3 li_receiver=2 li_provider=3 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled) 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=3 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled) 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=3 li_receiver=2 li_provider=1 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled) 2008-05-02 14:34:22 EST DEBUG1 remoteWorkerThread_1: helper thread for provider 1 terminated 2008-05-02 14:34:22 EST DEBUG1 remoteWorkerThread_1: disconnecting from data provider 1 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_3: start listening for event origin 1 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_3: forward confirm 1,66 received by 3 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_3: forward confirm 2,34 received by 3 2008-05-02 14:34:23 EST DEBUG2 remoteWorkerThread_1: forward confirm 2,34 received by 1 2008-05-02 14:34:23 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 36 2008-05-02 14:34:25 EST DEBUG2 localListenThread: Received event 2,34 SYNC 2008-05-02 14:34:25 EST DEBUG2 localListenThread: Received event 2,35 ACCEPT_SET 2008-05-02 14:34:25 EST DEBUG2 localListenThread: ACCEPT_SET Node3's slon daemon logfile contains: ================================= 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_2: Received event 2,35 ACCEPT_SET 2008-05-02 14:34:22 EST DEBUG2 start processing ACCEPT_SET 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: set=1 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: old origin=1 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: new origin=2 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: move set seq=67 2008-05-02 14:34:22 EST DEBUG2 got parms ACCEPT_SET 2008-05-02 14:34:22 EST DEBUG2 ACCEPT_SET - node not origin 2008-05-02 14:34:22 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: queue event 1,66 SYNC 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: queue event 1,67 MOVE_SET 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event 1,66 SYNC 2008-05-02 14:34:22 EST DEBUG2 calc sync size - last time: 1 last length: 111 ideal: 540 proposed size: 3 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 processing 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: syncing set 1 with 4 table(s) from provider 1 2008-05-02 14:34:22 EST DEBUG2 ssy_action_list length: 0 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: current local log_status is 0 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1_1: current remote log_status = 0 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.000 seconds delay for first row 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.000 seconds until close cursor 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: inserts=0 updates=0 deletes=0 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: new sl_rowid_seq value: 1000000000011000 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 done in 0.002 seconds 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event 1,67 MOVE_SET 2008-05-02 14:34:23 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 35 2008-05-02 14:34:24 EST DEBUG2 localListenThread: Received event 3,35 SYNC 2008-05-02 14:34:26 EST DEBUG2 remoteListenThread_2: queue event 2,36 SYNC 2008-05-02 14:34:30 EST DEBUG2 remoteListenThread_1: queue event 1,68 SYNC 2008-05-02 14:34:32 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-02 14:34:33 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 36 2008-05-02 14:34:34 EST DEBUG2 localListenThread: Received event 3,36 SYNC 2008-05-02 14:34:34 EST DEBUG2 remoteListenThread_1: queue event 1,69 SYNC 2008-05-02 14:34:34 EST DEBUG2 remoteListenThread_1: queue event 2,37 SYNC 2008-05-02 14:34:38 EST DEBUG2 remoteListenThread_1: LISTEN 2008-05-02 14:34:42 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-02 14:34:44 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 37 2008-05-02 14:34:44 EST DEBUG2 remoteListenThread_1: queue event 1,70 SYNC 2008-05-02 14:34:44 EST DEBUG2 remoteListenThread_1: UNLISTEN 2008-05-02 14:34:44 EST DEBUG2 remoteListenThread_2: queue event 2,38 SYNC 2008-05-02 14:34:44 EST DEBUG2 localListenThread: Received event 3,37 SYNC 2008-05-02 14:34:48 EST DEBUG2 remoteListenThread_1: LISTEN 2008-05-02 14:34:52 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep The "ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep" continue.... > Sometimes in 1.2.12 and 1.2.14rc the failover works, Yes, I've also experienced this - _occasionally_ the first 'move set' will work, but moving again always kills it. > > I'm more than happy to work thru this as I really want to push out > 8.3.1 and would love to have a functioning 1.2.14 slon release, but > something bad happened between 1.2.11 and current.. Either something > new that I have not added to my setup scripts or it's the code. > > I'll work with someone on this!!! I'm going to investigate this myself in more detail when I get some time - if I come up with anything useful, I'll be posting... I know this post is basically just yet another description of the issue, but I thought it might be valuable nonetheless. Anyone see anything wrong in my setup that might be causing this, instead of the bug? Thanks Charles Duffy
- Previous message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Next message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list