Thu May 1 23:49:55 PDT 2008
- Previous message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Next message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi > > Now with 1.2.12 and 1.2.14rc (have not tested 1.2.13 yet (but since > > it's apparent in 1.2.12 and in 1.2.14rc even with the "patch/possible > > fix", I'm guessing the issue is very much in 1.2.13 and there is a > > large issue as failover and switchover are key elements in this > > application. > > Just to add my test report on 1.2.14RC - I agree that it still exhibits > the "move set" problem. Do you attach the patch on the following pages ? http://lists.slony.info/pipermail/slony1-general/2008-March/007690.html > > I tested it out with a 3-node PostgreSQL 8.3.1 cluster: > > node3 <-- node1 --> node2 > > There's one replication set. It contains all the tables used in a pgbench > installation (the test app). The initial state was: origin on node1, > with node2 and > node3 both subscribing directly to node1. > > It works fine in this initial state: > > node1=# insert into accounts values (100000000, 100000, 100000, 'hello sailor'); > INSERT 0 1 > node1=# \c node2 > You are now connected to database "node2". > node2=# select * from accounts where aid = 100000000; > aid | bid | abalance | filler > -----------+--------+----------+-------------------------------------------------------------------------------------- > 100000000 | 100000 | 100000 | hello sailor > (1 row) > > node2=# \c node3 > You are now connected to database "node3". > node3=# select * from accounts where aid = 100000000; > aid | bid | abalance | filler > -----------+--------+----------+-------------------------------------------------------------------------------------- > 100000000 | 100000 | 100000 | hello sailor > (1 row) > > node3=# \q > > So, we're getting data. It worked with a large pgbench run as well. > > The move set, however, causes problems: > > [ccd at hpsystem slontest]$ slonik_move_set 1 1 2 | slonik > <stdin>:5: Locking down set 1 on node 1 > <stdin>:7: Locked down - moving it > <stdin>:9: Replication set 1 moved from node 1 to 2. Remember to > <stdin>:10: update your configuration file, if necessary, to note the > new location > <stdin>:11: for the set. > > [ccd at hpsystem slontest]$ psql -U postgres node1 > Password for user postgres: > Welcome to psql 8.3.1, the PostgreSQL interactive terminal. > > node1=# select * from _replication.sl_subscribe ; > sub_set | sub_provider | sub_receiver | sub_forward | sub_active > ---------+--------------+--------------+-------------+------------ > 1 | 1 | 3 | t | t > 1 | 2 | 1 | t | t > (2 rows) > > node1=# \c node2 > You are now connected to database "node2". > node2=# select * from _replication.sl_subscribe ; > sub_set | sub_provider | sub_receiver | sub_forward | sub_active > ---------+--------------+--------------+-------------+------------ > 1 | 1 | 3 | t | t > 1 | 2 | 1 | t | t > (2 rows) > > node2=# \c node3 > You are now connected to database "node3". > node3=# select * from _replication.sl_subscribe ; > sub_set | sub_provider | sub_receiver | sub_forward | sub_active > ---------+--------------+--------------+-------------+------------ > 1 | 1 | 2 | t | t > 1 | 1 | 3 | t | t > > > Node3 never reflects the new state of the cluster. Restarting slon > daemons was ineffective. > Looking at locks, the same symptomatic lock is present (as in all > other reports of this issue): > > > node3=# select * from pg_locks; > locktype | database | relation | page | tuple | virtualxid | > transactionid | classid | objid | objsubid | virtualtransaction | pid > | mode | granted > ---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+---------------------+--------- > virtualxid | | | | | 16/121 | > | | | | 16/121 | 3480 | > ExclusiveLock | t > relation | 16387 | 20718 | | | | > | | | | 16/121 | 3480 | > AccessExclusiveLock | f > relation | 16387 | 20663 | | | | > | | | | 21/108 | 3489 | > AccessShareLock | t > virtualxid | | | | | 9/130 | > | | | | 9/130 | 3426 | > ExclusiveLock | t > relation | 16386 | 20311 | | | | > | | | | 9/130 | 3426 | > RowExclusiveLock | t > relation | 16386 | 20303 | | | | > | | | | 9/130 | 3426 | > RowExclusiveLock | t > transactionid | | | | | | > 36277 | | | | 9/130 | 3426 | > ExclusiveLock | t > virtualxid | | | | | 4/396 | > | | | | 4/396 | 3382 | > ExclusiveLock | t > relation | 16387 | 10969 | | | | > | | | | 23/134 | 5232 | > AccessShareLock | t > virtualxid | | | | | 21/108 | > | | | | 21/108 | 3489 | > ExclusiveLock | t > relation | 16386 | 2614 | | | | > | | | | 9/130 | 3426 | > ExclusiveLock | t > relation | 16387 | 20669 | | | | > | | | | 21/108 | 3489 | > AccessShareLock | t > relation | 16386 | 2614 | | | | > | | | | 4/396 | 3382 | > ExclusiveLock | f > virtualxid | | | | | 23/134 | > | | | | 23/134 | 5232 | > ExclusiveLock | t > relation | 16387 | 20718 | | | | > | | | | 21/108 | 3489 | > AccessExclusiveLock | t > (15 rows) > > node3=# select * from pg_locks where granted = 'f'; > locktype | database | relation | page | tuple | virtualxid | > transactionid | classid | objid | objsubid | virtualtransaction | pid > | mode | granted > ----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+---------------------+--------- > relation | 16387 | 20718 | | | | > | | | | 16/121 | 3480 | > AccessExclusiveLock | f > (1 row) > > node3=# select relname,relnamespace from pg_class where oid = 20718; > relname | relnamespace > ----------------+-------------- > sl_config_lock | 20491 > (1 row) > > > looking at logfiles: > > Node2's slon daemon logfile contains: > ================================= > 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: queue event 1,67 MOVE_SET > 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: UNLISTEN > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event 1,66 SYNC > 2008-05-02 14:34:22 EST DEBUG2 calc sync size - last time: 1 last > length: 7719 ideal: 7 proposed size: 3 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 processing > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: syncing set 1 > with 4 table(s) from provider 1 > 2008-05-02 14:34:22 EST DEBUG2 ssy_action_list length: 0 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: current local > log_status is 0 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1_1: current remote > log_status = 0 > 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.001 seconds > delay for first row > 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.001 seconds > until close cursor > 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: inserts=0 > updates=0 deletes=0 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: new sl_rowid_seq > value: 1000000000011000 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 done in > 0.003 seconds > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event > 1,67 MOVE_SET > 2008-05-02 14:34:22 EST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2 > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=2 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=1 li_receiver=2 > li_provider=1 > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=3 li_receiver=2 > li_provider=3 > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=3 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=1 li_receiver=2 > li_provider=3 > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=3 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST CONFIG storeListen: li_origin=3 li_receiver=2 > li_provider=1 > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=1 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST DEBUG2 sched_wakeup_node(): no_id=3 (0 threads > + worker signaled) > 2008-05-02 14:34:22 EST DEBUG1 remoteWorkerThread_1: helper thread for > provider 1 terminated > 2008-05-02 14:34:22 EST DEBUG1 remoteWorkerThread_1: disconnecting > from data provider 1 > 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_3: start listening > for event origin 1 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_3: forward confirm > 1,66 received by 3 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_3: forward confirm > 2,34 received by 3 > 2008-05-02 14:34:23 EST DEBUG2 remoteWorkerThread_1: forward confirm > 2,34 received by 1 > 2008-05-02 14:34:23 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 36 > 2008-05-02 14:34:25 EST DEBUG2 localListenThread: Received event 2,34 SYNC > 2008-05-02 14:34:25 EST DEBUG2 localListenThread: Received event 2,35 ACCEPT_SET > 2008-05-02 14:34:25 EST DEBUG2 localListenThread: ACCEPT_SET > > Node3's slon daemon logfile contains: > ================================= > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_2: Received event > 2,35 ACCEPT_SET > 2008-05-02 14:34:22 EST DEBUG2 start processing ACCEPT_SET > 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: set=1 > 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: old origin=1 > 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: new origin=2 > 2008-05-02 14:34:22 EST DEBUG2 ACCEPT: move set seq=67 > 2008-05-02 14:34:22 EST DEBUG2 got parms ACCEPT_SET > 2008-05-02 14:34:22 EST DEBUG2 ACCEPT_SET - node not origin > 2008-05-02 14:34:22 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET > not received yet - sleep > 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: queue event 1,66 SYNC > 2008-05-02 14:34:22 EST DEBUG2 remoteListenThread_1: queue event 1,67 MOVE_SET > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event 1,66 SYNC > 2008-05-02 14:34:22 EST DEBUG2 calc sync size - last time: 1 last > length: 111 ideal: 540 proposed size: 3 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 processing > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: syncing set 1 > with 4 table(s) from provider 1 > 2008-05-02 14:34:22 EST DEBUG2 ssy_action_list length: 0 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: current local > log_status is 0 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1_1: current remote > log_status = 0 > 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.000 seconds > delay for first row > 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: 0.000 seconds > until close cursor > 2008-05-02 14:34:22 EST DEBUG2 remoteHelperThread_1_1: inserts=0 > updates=0 deletes=0 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: new sl_rowid_seq > value: 1000000000011000 > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: SYNC 66 done in > 0.002 seconds > 2008-05-02 14:34:22 EST DEBUG2 remoteWorkerThread_1: Received event > 1,67 MOVE_SET > 2008-05-02 14:34:23 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 35 > 2008-05-02 14:34:24 EST DEBUG2 localListenThread: Received event 3,35 SYNC > 2008-05-02 14:34:26 EST DEBUG2 remoteListenThread_2: queue event 2,36 SYNC > 2008-05-02 14:34:30 EST DEBUG2 remoteListenThread_1: queue event 1,68 SYNC > 2008-05-02 14:34:32 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET > not received yet - sleep > 2008-05-02 14:34:33 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 36 > 2008-05-02 14:34:34 EST DEBUG2 localListenThread: Received event 3,36 SYNC > 2008-05-02 14:34:34 EST DEBUG2 remoteListenThread_1: queue event 1,69 SYNC > 2008-05-02 14:34:34 EST DEBUG2 remoteListenThread_1: queue event 2,37 SYNC > 2008-05-02 14:34:38 EST DEBUG2 remoteListenThread_1: LISTEN > 2008-05-02 14:34:42 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET > not received yet - sleep > 2008-05-02 14:34:44 EST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 37 > 2008-05-02 14:34:44 EST DEBUG2 remoteListenThread_1: queue event 1,70 SYNC > 2008-05-02 14:34:44 EST DEBUG2 remoteListenThread_1: UNLISTEN > 2008-05-02 14:34:44 EST DEBUG2 remoteListenThread_2: queue event 2,38 SYNC > 2008-05-02 14:34:44 EST DEBUG2 localListenThread: Received event 3,37 SYNC > 2008-05-02 14:34:48 EST DEBUG2 remoteListenThread_1: LISTEN > 2008-05-02 14:34:52 EST DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET > not received yet - sleep > > The "ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep" > continue.... > > > Sometimes in 1.2.12 and 1.2.14rc the failover works, > > Yes, I've also experienced this - _occasionally_ the first 'move set' > will work, but moving again always kills it. > > > > > I'm more than happy to work thru this as I really want to push out > > 8.3.1 and would love to have a functioning 1.2.14 slon release, but > > something bad happened between 1.2.11 and current.. Either something > > new that I have not added to my setup scripts or it's the code. > > > > I'll work with someone on this!!! > > I'm going to investigate this myself in more detail when I get some > time - if I come up with anything useful, I'll be posting... > > I know this post is basically just yet another description of the > issue, but I thought it might be valuable nonetheless. Anyone see > anything wrong in my setup that might be causing this, instead of the > bug? > > Thanks > > Charles Duffy > _______________________________________________ > Slony1-general mailing list > Slony1-general at lists.slony.info > http://lists.slony.info/mailman/listinfo/slony1-general > -- SRA OSS, Inc. Japan Yoshiharu Mori <y-mori at sraoss.co.jp> http://www.sraoss.co.jp/
- Previous message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Next message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list