Mon Oct 10 18:28:31 PDT 2005
- Previous message: [Slony1-general] Re: Slony1-1.0.5 Failover does not work - replication set isn't being moved
- Next message: [Slony1-general] error in cleanup_thread
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Yes you're right. The backup node's (node 3) sl_setsync has a reference to node 1: ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list -----------+------------+-----------+------------+------------+-----------+----------------- 1 | 1 | 326 | 2774085 | 2774088 | '2774085' | (1 row) Node 2's sl_setsync has this: ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list -----------+------------+-----------+------------+------------+---------+----------------- 1 | 3 | 124 | 2595423 | 2595424 | | (1 row) On 10/4/05, elein <elein at varlena.com> wrote: > > Yes, it should. But it doesn't. I believe any message is ever > sent to the 3rd node. This is the same in my example. See also the > sl_setsync > table. It has a reference to node 1 (or 10). > > On Tue, Oct 04, 2005 at 06:17:06PM -0400, Fiel Cabral wrote: > > The sl_event table on Node 2 contains a FAILOVER_SET event but node 3 > (the > > backup node specified in the failover command) does not. Should the > backup > > node's sl_event table contain the FAILOVER_SET? > > > > sl_event on node 2 contains a FAILOVER_SET: > > ev_timestamp | ev_origin | ev_seqno | ev_type > > > ----------------------------+-----------+----------+--------------------- > > 2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH > > 2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH > > 2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN > > 2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN > > 2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET > > 2005-10-04 17:49:05.608095 | 1 | 306 | STORE_NODE > > 2005-10-04 17:49:05.608095 | 1 | 307 | ENABLE_NODE > > 2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE > > 2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE > > 2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH > > 2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH > > 2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN > > 2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN > > 2005-10-04 17:55:02.413185 | 2 | 6 | SYNC > > 2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION > > 2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH > > 2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION > > 2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH > > 2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN > > 2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN > > 2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET > > 2005-10-04 17:49:52.680621 | 1 | 316 | SYNC > > 2005-10-04 17:50:53.010532 | 1 | 317 | SYNC > > 2005-10-04 17:51:53.112317 | 1 | 318 | SYNC > > 2005-10-04 17:52:53.146222 | 1 | 319 | SYNC > > 2005-10-04 17:53:53.192119 | 1 | 320 | SYNC > > 2005-10-04 17:54:53.602106 | 1 | 321 | SYNC > > 2005-10-04 17:55:53.710807 | 1 | 322 | SYNC > > 2005-10-04 17:56:02.893106 | 2 | 7 | SYNC > > 2005-10-04 17:56:42.786823 | 3 | 6 | SYNC > > 2005-10-04 17:56:53.833985 | 1 | 323 | SYNC > > 2005-10-04 17:57:03.007883 | 2 | 8 | SYNC > > 2005-10-04 17:57:43.692981 | 3 | 7 | SYNC > > 2005-10-04 17:57:53.902912 | 1 | 324 | SYNC > > 2005-10-04 17:58:03.062867 | 2 | 9 | SYNC > > 2005-10-04 17:58:43.736478 | 3 | 8 | SYNC > > 2005-10-04 17:58:53.953325 | 1 | 325 | SYNC > > 2005-10-04 17:59:03.112996 | 2 | 10 | SYNC > > 2005-10-04 17:59:43.77303 | 3 | 9 | SYNC > > 2005-10-04 17:59:54.095892 | 1 | 326 | SYNC > > 2005-10-04 18:00:03.155204 | 2 | 11 | SYNC > > 2005-10-04 18:00:43.810793 | 3 | 10 | SYNC > > 2005-10-04 18:01:03.196571 | 2 | 12 | SYNC > > 2005-10-04 18:01:43.865925 | 3 | 11 | SYNC > > 2005-10-04 18:02:03.216029 | 2 | 13 | SYNC > > 2005-10-04 18:02:43.905505 | 3 | 12 | SYNC > > 2005-10-04 18:03:03.238632 | 2 | 14 | SYNC > > 2005-10-04 18:03:38.947704 | 1 | 327 | FAILOVER_SET > > 2005-10-04 18:03:48.819508 | 3 | 13 | SYNC > > 2005-10-04 18:03:49.921361 | 2 | 15 | SYNC > > 2005-10-04 18:04:48.875801 | 3 | 14 | SYNC > > 2005-10-04 18:04:49.970829 | 2 | 16 | SYNC > > 2005-10-04 18:05:48.92941 | 3 | 15 | SYNC > > 2005-10-04 18:05:49.985511 | 2 | 17 | SYNC > > 2005-10-04 18:06:48.963277 | 3 | 16 | SYNC > > 2005-10-04 18:06:49.998737 | 2 | 18 | SYNC > > 2005-10-04 18:07:49.033346 | 3 | 17 | SYNC > > 2005-10-04 18:07:50.028334 | 2 | 19 | SYNC > > 2005-10-04 18:08:49.051861 | 3 | 18 | SYNC > > 2005-10-04 18:08:50.056542 | 2 | 20 | SYNC > > 2005-10-04 18:09:49.075309 | 3 | 19 | SYNC > > 2005-10-04 18:09:50.093277 | 2 | 21 | SYNC > > (62 rows) > > > > sl_event on node 3 (backup node) does not have the FAILOVER_SET: > > > > ev_timestamp | ev_origin | ev_seqno | ev_type > > > ----------------------------+-----------+----------+--------------------- > > 2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH > > 2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH > > 2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN > > 2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN > > 2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET > > 2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH > > 2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE > > 2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH > > 2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE > > 2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN > > 2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH > > 2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN > > 2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH > > 2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET > > 2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN > > 2005-10-04 17:55:02.413185 | 2 | 6 | SYNC > > 2005-10-04 17:56:02.893106 | 2 | 7 | SYNC > > 2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN > > 2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION > > 2005-10-04 17:56:42.786823 | 3 | 6 | SYNC > > 2005-10-04 17:57:03.007883 | 2 | 8 | SYNC > > 2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION > > 2005-10-04 17:49:52.680621 | 1 | 316 | SYNC > > 2005-10-04 17:50:53.010532 | 1 | 317 | SYNC > > 2005-10-04 17:51:53.112317 | 1 | 318 | SYNC > > 2005-10-04 17:52:53.146222 | 1 | 319 | SYNC > > 2005-10-04 17:53:53.192119 | 1 | 320 | SYNC > > 2005-10-04 17:54:53.602106 | 1 | 321 | SYNC > > 2005-10-04 17:55:53.710807 | 1 | 322 | SYNC > > 2005-10-04 17:56:53.833985 | 1 | 323 | SYNC > > 2005-10-04 17:57:43.692981 | 3 | 7 | SYNC > > 2005-10-04 17:57:53.902912 | 1 | 324 | SYNC > > 2005-10-04 17:58:03.062867 | 2 | 9 | SYNC > > 2005-10-04 17:58:43.736478 | 3 | 8 | SYNC > > 2005-10-04 17:58:53.953325 | 1 | 325 | SYNC > > 2005-10-04 17:59:03.112996 | 2 | 10 | SYNC > > 2005-10-04 17:59:43.77303 | 3 | 9 | SYNC > > 2005-10-04 17:59:54.095892 | 1 | 326 | SYNC > > 2005-10-04 18:00:03.155204 | 2 | 11 | SYNC > > 2005-10-04 18:00:43.810793 | 3 | 10 | SYNC > > 2005-10-04 18:01:03.196571 | 2 | 12 | SYNC > > 2005-10-04 18:01:43.865925 | 3 | 11 | SYNC > > 2005-10-04 18:02:03.216029 | 2 | 13 | SYNC > > 2005-10-04 18:02:43.905505 | 3 | 12 | SYNC > > 2005-10-04 18:03:03.238632 | 2 | 14 | SYNC > > 2005-10-04 18:03:48.819508 | 3 | 13 | SYNC > > 2005-10-04 18:03:49.921361 | 2 | 15 | SYNC > > 2005-10-04 18:04:48.875801 | 3 | 14 | SYNC > > 2005-10-04 18:04:49.970829 | 2 | 16 | SYNC > > 2005-10-04 18:05:48.92941 | 3 | 15 | SYNC > > 2005-10-04 18:05:49.985511 | 2 | 17 | SYNC > > 2005-10-04 18:06:48.963277 | 3 | 16 | SYNC > > 2005-10-04 18:06:49.998737 | 2 | 18 | SYNC > > 2005-10-04 18:07:49.033346 | 3 | 17 | SYNC > > 2005-10-04 18:07:50.028334 | 2 | 19 | SYNC > > 2005-10-04 18:08:49.051861 | 3 | 18 | SYNC > > 2005-10-04 18:08:50.056542 | 2 | 20 | SYNC > > 2005-10-04 18:09:49.075309 | 3 | 19 | SYNC > > 2005-10-04 18:09:50.093277 | 2 | 21 | SYNC > > 2005-10-04 18:10:49.100012 | 3 | 20 | SYNC > > 2005-10-04 18:10:50.117138 | 2 | 22 | SYNC > > (61 rows) > > > > > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com> > wrote: > > > > The problem persists after the node IDs were changed from [1, 2, 3] to > [10, > > 20, 30]. > > > > Inside gdb, the failedNode2 query did not return an error (function > return > > value was 0). > > > > Node 2 was able to move the set_origin = node 3. > > Nodes 3 is stuck with set_origin = node 1. > > > > > > On 10/4/05, Fiel Cabral < e4696wyoa63emq6w3250kiw60i45e1 at gmail.com > > wrote: > > > > Thanks Elein. I'll run gdb and step through slonik_failed_node to > > (maybe) see if failedNode2 is failing. > > > > > > > > On 10/4/05, elein <elein at varlena.com > wrote: > > > > Fiel, > > > > In my own tests, with node 10->20->30, failover from 10 to 20 > > failed > > because node 30 was unusable and had to be recreated from scratch. > > This is a serious bug in my book. > > > > In one case the problem seemed to be dropping the first node > > "too soon". I have not tested that case so I don't know that > > this was the problem. > > > > What I have verified is that the third node never recieved any > > message > > regarding the failover and did not change its information > > to get its table set from the new origin, 20. > > > > Also, try not to use Node 1, 2, 3. Node 1 has some special meaning > > in some cases that you will want to avoid. > > > > We are with you, not ignoring you. > > > > --elein > > > > On Tue, Oct 04, 2005 at 11:13:19AM -0400, Fiel Cabral wrote: > > > Right after running the failover command I issue the DROP NODE > > command to drop > > > node 1. slonik prints an error message and exits with return > > value 12: > > > > > > sys:17: TRY: drop node > > > sys:19: PGRES_FATAL_ERROR select "_whatever".dropNode(1); - > > ERROR: Slony-I: > > > Node 1 is still origin of one or more sets > > > > > > Something should have changed the origin to node 3 but it isn't > > happening. > > > > > > > > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com > > > wrote: > > > > > > I have 3 nodes. Nodes 2 and 3 are subscribers of node 1 and > > I'm trying to > > > failover from node 1 to node 3. The failover command succeeds > > but the > > > database of node 3 is still read-only and the origin is still > > node 1. I > > > don't have the same problem when doing failover with only two > > nodes because > > > the set is moved immediately by failedNode. > > > > > > failedNode (in the code below) is able to set the provider > > successfully. > > > > > > Some code elsewhere is actually moving the replication set. > > Where is that > > > code? Is it in slon or slonik or in the sql scripts? > > > > > > How do I find out that slon caught the signal and is doing > > the right thing > > > in response to the signal? > > > > > > 784 raise notice ''failedNode: set % has other direct > > receivers - > > > change providers only'', v_row.set_id; > > > 785 -- ---- > > > 786 -- Backup node is not the > > only direct > > > subscriber. This > > > 787 -- means that at this moment, > > we redirect > > > all direct > > > 788 -- subscribers to receive > > from the backup > > > node, and the > > > 789 -- backup node itself to > > receive from > > > another one. > > > 790 -- The admin utility will > > wait for the slon > > > engine to > > > 791 -- restart and then call > > failedNode2() on > > > the node with > > > 792 -- the highest SYNC and > > redirect this to it > > > on > > > 793 -- backup node later. > > > 794 -- ---- > > > ... etc ... > > > 811 > > > 812 -- ---- > > > 813 -- Make sure the node daemon will restart > > > 814 -- ---- > > > 815 notify "_ at CLUSTERNAME@_Restart"; > > > 816 > > > > > > -Fiel > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Slony1-general mailing list > > > Slony1-general at gborg.postgresql.org > > > http://gborg.postgresql.org/mailman/listinfo/slony1-general > > > > > > > > > > > > > > > > > _______________________________________________ > > Slony1-general mailing list > > Slony1-general at gborg.postgresql.org > > http://gborg.postgresql.org/mailman/listinfo/slony1-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20051005/2d700635/attachment-0001.html
- Previous message: [Slony1-general] Re: Slony1-1.0.5 Failover does not work - replication set isn't being moved
- Next message: [Slony1-general] error in cleanup_thread
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list