Wed Oct 5 00:30:06 PDT 2005
- Previous message: [Slony1-general] Re: Slony1-1.0.5 Failover does not work - replication set isn't being moved
- Next message: [Slony1-general] Re: Slony1-1.0.5 Failover does not work - replication set isn't being moved
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Yes, it should. But it doesn't. I believe any message is ever sent to the 3rd node. This is the same in my example. See also the sl_setsync table. It has a reference to node 1 (or 10). On Tue, Oct 04, 2005 at 06:17:06PM -0400, Fiel Cabral wrote: > The sl_event table on Node 2 contains a FAILOVER_SET event but node 3 (the > backup node specified in the failover command) does not. Should the backup > node's sl_event table contain the FAILOVER_SET? > > sl_event on node 2 contains a FAILOVER_SET: > ev_timestamp | ev_origin | ev_seqno | ev_type > ----------------------------+-----------+----------+--------------------- > 2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH > 2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH > 2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN > 2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN > 2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET > 2005-10-04 17:49:05.608095 | 1 | 306 | STORE_NODE > 2005-10-04 17:49:05.608095 | 1 | 307 | ENABLE_NODE > 2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE > 2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE > 2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH > 2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH > 2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN > 2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN > 2005-10-04 17:55:02.413185 | 2 | 6 | SYNC > 2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION > 2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH > 2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION > 2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH > 2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN > 2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN > 2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET > 2005-10-04 17:49:52.680621 | 1 | 316 | SYNC > 2005-10-04 17:50:53.010532 | 1 | 317 | SYNC > 2005-10-04 17:51:53.112317 | 1 | 318 | SYNC > 2005-10-04 17:52:53.146222 | 1 | 319 | SYNC > 2005-10-04 17:53:53.192119 | 1 | 320 | SYNC > 2005-10-04 17:54:53.602106 | 1 | 321 | SYNC > 2005-10-04 17:55:53.710807 | 1 | 322 | SYNC > 2005-10-04 17:56:02.893106 | 2 | 7 | SYNC > 2005-10-04 17:56:42.786823 | 3 | 6 | SYNC > 2005-10-04 17:56:53.833985 | 1 | 323 | SYNC > 2005-10-04 17:57:03.007883 | 2 | 8 | SYNC > 2005-10-04 17:57:43.692981 | 3 | 7 | SYNC > 2005-10-04 17:57:53.902912 | 1 | 324 | SYNC > 2005-10-04 17:58:03.062867 | 2 | 9 | SYNC > 2005-10-04 17:58:43.736478 | 3 | 8 | SYNC > 2005-10-04 17:58:53.953325 | 1 | 325 | SYNC > 2005-10-04 17:59:03.112996 | 2 | 10 | SYNC > 2005-10-04 17:59:43.77303 | 3 | 9 | SYNC > 2005-10-04 17:59:54.095892 | 1 | 326 | SYNC > 2005-10-04 18:00:03.155204 | 2 | 11 | SYNC > 2005-10-04 18:00:43.810793 | 3 | 10 | SYNC > 2005-10-04 18:01:03.196571 | 2 | 12 | SYNC > 2005-10-04 18:01:43.865925 | 3 | 11 | SYNC > 2005-10-04 18:02:03.216029 | 2 | 13 | SYNC > 2005-10-04 18:02:43.905505 | 3 | 12 | SYNC > 2005-10-04 18:03:03.238632 | 2 | 14 | SYNC > 2005-10-04 18:03:38.947704 | 1 | 327 | FAILOVER_SET > 2005-10-04 18:03:48.819508 | 3 | 13 | SYNC > 2005-10-04 18:03:49.921361 | 2 | 15 | SYNC > 2005-10-04 18:04:48.875801 | 3 | 14 | SYNC > 2005-10-04 18:04:49.970829 | 2 | 16 | SYNC > 2005-10-04 18:05:48.92941 | 3 | 15 | SYNC > 2005-10-04 18:05:49.985511 | 2 | 17 | SYNC > 2005-10-04 18:06:48.963277 | 3 | 16 | SYNC > 2005-10-04 18:06:49.998737 | 2 | 18 | SYNC > 2005-10-04 18:07:49.033346 | 3 | 17 | SYNC > 2005-10-04 18:07:50.028334 | 2 | 19 | SYNC > 2005-10-04 18:08:49.051861 | 3 | 18 | SYNC > 2005-10-04 18:08:50.056542 | 2 | 20 | SYNC > 2005-10-04 18:09:49.075309 | 3 | 19 | SYNC > 2005-10-04 18:09:50.093277 | 2 | 21 | SYNC > (62 rows) > > sl_event on node 3 (backup node) does not have the FAILOVER_SET: > > ev_timestamp | ev_origin | ev_seqno | ev_type > ----------------------------+-----------+----------+--------------------- > 2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH > 2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH > 2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN > 2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN > 2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET > 2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH > 2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE > 2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH > 2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE > 2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN > 2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH > 2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN > 2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH > 2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET > 2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN > 2005-10-04 17:55:02.413185 | 2 | 6 | SYNC > 2005-10-04 17:56:02.893106 | 2 | 7 | SYNC > 2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN > 2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION > 2005-10-04 17:56:42.786823 | 3 | 6 | SYNC > 2005-10-04 17:57:03.007883 | 2 | 8 | SYNC > 2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION > 2005-10-04 17:49:52.680621 | 1 | 316 | SYNC > 2005-10-04 17:50:53.010532 | 1 | 317 | SYNC > 2005-10-04 17:51:53.112317 | 1 | 318 | SYNC > 2005-10-04 17:52:53.146222 | 1 | 319 | SYNC > 2005-10-04 17:53:53.192119 | 1 | 320 | SYNC > 2005-10-04 17:54:53.602106 | 1 | 321 | SYNC > 2005-10-04 17:55:53.710807 | 1 | 322 | SYNC > 2005-10-04 17:56:53.833985 | 1 | 323 | SYNC > 2005-10-04 17:57:43.692981 | 3 | 7 | SYNC > 2005-10-04 17:57:53.902912 | 1 | 324 | SYNC > 2005-10-04 17:58:03.062867 | 2 | 9 | SYNC > 2005-10-04 17:58:43.736478 | 3 | 8 | SYNC > 2005-10-04 17:58:53.953325 | 1 | 325 | SYNC > 2005-10-04 17:59:03.112996 | 2 | 10 | SYNC > 2005-10-04 17:59:43.77303 | 3 | 9 | SYNC > 2005-10-04 17:59:54.095892 | 1 | 326 | SYNC > 2005-10-04 18:00:03.155204 | 2 | 11 | SYNC > 2005-10-04 18:00:43.810793 | 3 | 10 | SYNC > 2005-10-04 18:01:03.196571 | 2 | 12 | SYNC > 2005-10-04 18:01:43.865925 | 3 | 11 | SYNC > 2005-10-04 18:02:03.216029 | 2 | 13 | SYNC > 2005-10-04 18:02:43.905505 | 3 | 12 | SYNC > 2005-10-04 18:03:03.238632 | 2 | 14 | SYNC > 2005-10-04 18:03:48.819508 | 3 | 13 | SYNC > 2005-10-04 18:03:49.921361 | 2 | 15 | SYNC > 2005-10-04 18:04:48.875801 | 3 | 14 | SYNC > 2005-10-04 18:04:49.970829 | 2 | 16 | SYNC > 2005-10-04 18:05:48.92941 | 3 | 15 | SYNC > 2005-10-04 18:05:49.985511 | 2 | 17 | SYNC > 2005-10-04 18:06:48.963277 | 3 | 16 | SYNC > 2005-10-04 18:06:49.998737 | 2 | 18 | SYNC > 2005-10-04 18:07:49.033346 | 3 | 17 | SYNC > 2005-10-04 18:07:50.028334 | 2 | 19 | SYNC > 2005-10-04 18:08:49.051861 | 3 | 18 | SYNC > 2005-10-04 18:08:50.056542 | 2 | 20 | SYNC > 2005-10-04 18:09:49.075309 | 3 | 19 | SYNC > 2005-10-04 18:09:50.093277 | 2 | 21 | SYNC > 2005-10-04 18:10:49.100012 | 3 | 20 | SYNC > 2005-10-04 18:10:50.117138 | 2 | 22 | SYNC > (61 rows) > > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com> wrote: > > The problem persists after the node IDs were changed from [1, 2, 3] to [10, > 20, 30]. > > Inside gdb, the failedNode2 query did not return an error (function return > value was 0). > > Node 2 was able to move the set_origin = node 3. > Nodes 3 is stuck with set_origin = node 1. > > > On 10/4/05, Fiel Cabral < e4696wyoa63emq6w3250kiw60i45e1 at gmail.com > wrote: > > Thanks Elein. I'll run gdb and step through slonik_failed_node to > (maybe) see if failedNode2 is failing. > > > > On 10/4/05, elein <elein at varlena.com > wrote: > > Fiel, > > In my own tests, with node 10->20->30, failover from 10 to 20 > failed > because node 30 was unusable and had to be recreated from scratch. > This is a serious bug in my book. > > In one case the problem seemed to be dropping the first node > "too soon". I have not tested that case so I don't know that > this was the problem. > > What I have verified is that the third node never recieved any > message > regarding the failover and did not change its information > to get its table set from the new origin, 20. > > Also, try not to use Node 1, 2, 3. Node 1 has some special meaning > in some cases that you will want to avoid. > > We are with you, not ignoring you. > > --elein > > On Tue, Oct 04, 2005 at 11:13:19AM -0400, Fiel Cabral wrote: > > Right after running the failover command I issue the DROP NODE > command to drop > > node 1. slonik prints an error message and exits with return > value 12: > > > > sys:17: TRY: drop node > > sys:19: PGRES_FATAL_ERROR select "_whatever".dropNode(1); - > ERROR: Slony-I: > > Node 1 is still origin of one or more sets > > > > Something should have changed the origin to node 3 but it isn't > happening. > > > > > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com > > wrote: > > > > I have 3 nodes. Nodes 2 and 3 are subscribers of node 1 and > I'm trying to > > failover from node 1 to node 3. The failover command succeeds > but the > > database of node 3 is still read-only and the origin is still > node 1. I > > don't have the same problem when doing failover with only two > nodes because > > the set is moved immediately by failedNode. > > > > failedNode (in the code below) is able to set the provider > successfully. > > > > Some code elsewhere is actually moving the replication set. > Where is that > > code? Is it in slon or slonik or in the sql scripts? > > > > How do I find out that slon caught the signal and is doing > the right thing > > in response to the signal? > > > > 784 raise notice ''failedNode: set % has other direct > receivers - > > change providers only'', v_row.set_id; > > 785 -- ---- > > 786 -- Backup node is not the > only direct > > subscriber. This > > 787 -- means that at this moment, > we redirect > > all direct > > 788 -- subscribers to receive > from the backup > > node, and the > > 789 -- backup node itself to > receive from > > another one. > > 790 -- The admin utility will > wait for the slon > > engine to > > 791 -- restart and then call > failedNode2() on > > the node with > > 792 -- the highest SYNC and > redirect this to it > > on > > 793 -- backup node later. > > 794 -- ---- > > ... etc ... > > 811 > > 812 -- ---- > > 813 -- Make sure the node daemon will restart > > 814 -- ---- > > 815 notify "_ at CLUSTERNAME@_Restart"; > > 816 > > > > -Fiel > > > > > > > > > > > > > > > _______________________________________________ > > Slony1-general mailing list > > Slony1-general at gborg.postgresql.org > > http://gborg.postgresql.org/mailman/listinfo/slony1-general > > > > > > > > _______________________________________________ > Slony1-general mailing list > Slony1-general at gborg.postgresql.org > http://gborg.postgresql.org/mailman/listinfo/slony1-general
- Previous message: [Slony1-general] Re: Slony1-1.0.5 Failover does not work - replication set isn't being moved
- Next message: [Slony1-general] Re: Slony1-1.0.5 Failover does not work - replication set isn't being moved
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list