Fiel Cabral e4696wyoa63emq6w3250kiw60i45e1
Mon Oct 10 18:28:31 PDT 2005
Yes you're right.

The backup node's (node 3) sl_setsync has a reference to node 1:
ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
ssy_action_list
-----------+------------+-----------+------------+------------+-----------+-----------------
1 | 1 | 326 | 2774085 | 2774088 | '2774085' |
(1 row)

Node 2's sl_setsync has this:
ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
ssy_action_list
-----------+------------+-----------+------------+------------+---------+-----------------
1 | 3 | 124 | 2595423 | 2595424 | |
(1 row)


On 10/4/05, elein <elein at varlena.com> wrote:
>
> Yes, it should. But it doesn't. I believe any message is ever
> sent to the 3rd node. This is the same in my example. See also the
> sl_setsync
> table. It has a reference to node 1 (or 10).
>
> On Tue, Oct 04, 2005 at 06:17:06PM -0400, Fiel Cabral wrote:
> > The sl_event table on Node 2 contains a FAILOVER_SET event but node 3
> (the
> > backup node specified in the failover command) does not. Should the
> backup
> > node's sl_event table contain the FAILOVER_SET?
> >
> > sl_event on node 2 contains a FAILOVER_SET:
> > ev_timestamp | ev_origin | ev_seqno | ev_type
> >
> ----------------------------+-----------+----------+---------------------
> > 2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH
> > 2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH
> > 2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN
> > 2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN
> > 2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET
> > 2005-10-04 17:49:05.608095 | 1 | 306 | STORE_NODE
> > 2005-10-04 17:49:05.608095 | 1 | 307 | ENABLE_NODE
> > 2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE
> > 2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE
> > 2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH
> > 2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH
> > 2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN
> > 2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN
> > 2005-10-04 17:55:02.413185 | 2 | 6 | SYNC
> > 2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION
> > 2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH
> > 2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION
> > 2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH
> > 2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN
> > 2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN
> > 2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET
> > 2005-10-04 17:49:52.680621 | 1 | 316 | SYNC
> > 2005-10-04 17:50:53.010532 | 1 | 317 | SYNC
> > 2005-10-04 17:51:53.112317 | 1 | 318 | SYNC
> > 2005-10-04 17:52:53.146222 | 1 | 319 | SYNC
> > 2005-10-04 17:53:53.192119 | 1 | 320 | SYNC
> > 2005-10-04 17:54:53.602106 | 1 | 321 | SYNC
> > 2005-10-04 17:55:53.710807 | 1 | 322 | SYNC
> > 2005-10-04 17:56:02.893106 | 2 | 7 | SYNC
> > 2005-10-04 17:56:42.786823 | 3 | 6 | SYNC
> > 2005-10-04 17:56:53.833985 | 1 | 323 | SYNC
> > 2005-10-04 17:57:03.007883 | 2 | 8 | SYNC
> > 2005-10-04 17:57:43.692981 | 3 | 7 | SYNC
> > 2005-10-04 17:57:53.902912 | 1 | 324 | SYNC
> > 2005-10-04 17:58:03.062867 | 2 | 9 | SYNC
> > 2005-10-04 17:58:43.736478 | 3 | 8 | SYNC
> > 2005-10-04 17:58:53.953325 | 1 | 325 | SYNC
> > 2005-10-04 17:59:03.112996 | 2 | 10 | SYNC
> > 2005-10-04 17:59:43.77303 | 3 | 9 | SYNC
> > 2005-10-04 17:59:54.095892 | 1 | 326 | SYNC
> > 2005-10-04 18:00:03.155204 | 2 | 11 | SYNC
> > 2005-10-04 18:00:43.810793 | 3 | 10 | SYNC
> > 2005-10-04 18:01:03.196571 | 2 | 12 | SYNC
> > 2005-10-04 18:01:43.865925 | 3 | 11 | SYNC
> > 2005-10-04 18:02:03.216029 | 2 | 13 | SYNC
> > 2005-10-04 18:02:43.905505 | 3 | 12 | SYNC
> > 2005-10-04 18:03:03.238632 | 2 | 14 | SYNC
> > 2005-10-04 18:03:38.947704 | 1 | 327 | FAILOVER_SET
> > 2005-10-04 18:03:48.819508 | 3 | 13 | SYNC
> > 2005-10-04 18:03:49.921361 | 2 | 15 | SYNC
> > 2005-10-04 18:04:48.875801 | 3 | 14 | SYNC
> > 2005-10-04 18:04:49.970829 | 2 | 16 | SYNC
> > 2005-10-04 18:05:48.92941 | 3 | 15 | SYNC
> > 2005-10-04 18:05:49.985511 | 2 | 17 | SYNC
> > 2005-10-04 18:06:48.963277 | 3 | 16 | SYNC
> > 2005-10-04 18:06:49.998737 | 2 | 18 | SYNC
> > 2005-10-04 18:07:49.033346 | 3 | 17 | SYNC
> > 2005-10-04 18:07:50.028334 | 2 | 19 | SYNC
> > 2005-10-04 18:08:49.051861 | 3 | 18 | SYNC
> > 2005-10-04 18:08:50.056542 | 2 | 20 | SYNC
> > 2005-10-04 18:09:49.075309 | 3 | 19 | SYNC
> > 2005-10-04 18:09:50.093277 | 2 | 21 | SYNC
> > (62 rows)
> >
> > sl_event on node 3 (backup node) does not have the FAILOVER_SET:
> >
> > ev_timestamp | ev_origin | ev_seqno | ev_type
> >
> ----------------------------+-----------+----------+---------------------
> > 2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH
> > 2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH
> > 2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN
> > 2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN
> > 2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET
> > 2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH
> > 2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE
> > 2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH
> > 2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE
> > 2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN
> > 2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH
> > 2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN
> > 2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH
> > 2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET
> > 2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN
> > 2005-10-04 17:55:02.413185 | 2 | 6 | SYNC
> > 2005-10-04 17:56:02.893106 | 2 | 7 | SYNC
> > 2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN
> > 2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION
> > 2005-10-04 17:56:42.786823 | 3 | 6 | SYNC
> > 2005-10-04 17:57:03.007883 | 2 | 8 | SYNC
> > 2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION
> > 2005-10-04 17:49:52.680621 | 1 | 316 | SYNC
> > 2005-10-04 17:50:53.010532 | 1 | 317 | SYNC
> > 2005-10-04 17:51:53.112317 | 1 | 318 | SYNC
> > 2005-10-04 17:52:53.146222 | 1 | 319 | SYNC
> > 2005-10-04 17:53:53.192119 | 1 | 320 | SYNC
> > 2005-10-04 17:54:53.602106 | 1 | 321 | SYNC
> > 2005-10-04 17:55:53.710807 | 1 | 322 | SYNC
> > 2005-10-04 17:56:53.833985 | 1 | 323 | SYNC
> > 2005-10-04 17:57:43.692981 | 3 | 7 | SYNC
> > 2005-10-04 17:57:53.902912 | 1 | 324 | SYNC
> > 2005-10-04 17:58:03.062867 | 2 | 9 | SYNC
> > 2005-10-04 17:58:43.736478 | 3 | 8 | SYNC
> > 2005-10-04 17:58:53.953325 | 1 | 325 | SYNC
> > 2005-10-04 17:59:03.112996 | 2 | 10 | SYNC
> > 2005-10-04 17:59:43.77303 | 3 | 9 | SYNC
> > 2005-10-04 17:59:54.095892 | 1 | 326 | SYNC
> > 2005-10-04 18:00:03.155204 | 2 | 11 | SYNC
> > 2005-10-04 18:00:43.810793 | 3 | 10 | SYNC
> > 2005-10-04 18:01:03.196571 | 2 | 12 | SYNC
> > 2005-10-04 18:01:43.865925 | 3 | 11 | SYNC
> > 2005-10-04 18:02:03.216029 | 2 | 13 | SYNC
> > 2005-10-04 18:02:43.905505 | 3 | 12 | SYNC
> > 2005-10-04 18:03:03.238632 | 2 | 14 | SYNC
> > 2005-10-04 18:03:48.819508 | 3 | 13 | SYNC
> > 2005-10-04 18:03:49.921361 | 2 | 15 | SYNC
> > 2005-10-04 18:04:48.875801 | 3 | 14 | SYNC
> > 2005-10-04 18:04:49.970829 | 2 | 16 | SYNC
> > 2005-10-04 18:05:48.92941 | 3 | 15 | SYNC
> > 2005-10-04 18:05:49.985511 | 2 | 17 | SYNC
> > 2005-10-04 18:06:48.963277 | 3 | 16 | SYNC
> > 2005-10-04 18:06:49.998737 | 2 | 18 | SYNC
> > 2005-10-04 18:07:49.033346 | 3 | 17 | SYNC
> > 2005-10-04 18:07:50.028334 | 2 | 19 | SYNC
> > 2005-10-04 18:08:49.051861 | 3 | 18 | SYNC
> > 2005-10-04 18:08:50.056542 | 2 | 20 | SYNC
> > 2005-10-04 18:09:49.075309 | 3 | 19 | SYNC
> > 2005-10-04 18:09:50.093277 | 2 | 21 | SYNC
> > 2005-10-04 18:10:49.100012 | 3 | 20 | SYNC
> > 2005-10-04 18:10:50.117138 | 2 | 22 | SYNC
> > (61 rows)
> >
> >
> > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com>
> wrote:
> >
> > The problem persists after the node IDs were changed from [1, 2, 3] to
> [10,
> > 20, 30].
> >
> > Inside gdb, the failedNode2 query did not return an error (function
> return
> > value was 0).
> >
> > Node 2 was able to move the set_origin = node 3.
> > Nodes 3 is stuck with set_origin = node 1.
> >
> >
> > On 10/4/05, Fiel Cabral < e4696wyoa63emq6w3250kiw60i45e1 at gmail.com >
> wrote:
> >
> > Thanks Elein. I'll run gdb and step through slonik_failed_node to
> > (maybe) see if failedNode2 is failing.
> >
> >
> >
> > On 10/4/05, elein <elein at varlena.com > wrote:
> >
> > Fiel,
> >
> > In my own tests, with node 10->20->30, failover from 10 to 20
> > failed
> > because node 30 was unusable and had to be recreated from scratch.
> > This is a serious bug in my book.
> >
> > In one case the problem seemed to be dropping the first node
> > "too soon". I have not tested that case so I don't know that
> > this was the problem.
> >
> > What I have verified is that the third node never recieved any
> > message
> > regarding the failover and did not change its information
> > to get its table set from the new origin, 20.
> >
> > Also, try not to use Node 1, 2, 3. Node 1 has some special meaning
> > in some cases that you will want to avoid.
> >
> > We are with you, not ignoring you.
> >
> > --elein
> >
> > On Tue, Oct 04, 2005 at 11:13:19AM -0400, Fiel Cabral wrote:
> > > Right after running the failover command I issue the DROP NODE
> > command to drop
> > > node 1. slonik prints an error message and exits with return
> > value 12:
> > >
> > > sys:17: TRY: drop node
> > > sys:19: PGRES_FATAL_ERROR select "_whatever".dropNode(1); -
> > ERROR: Slony-I:
> > > Node 1 is still origin of one or more sets
> > >
> > > Something should have changed the origin to node 3 but it isn't
> > happening.
> > >
> > >
> > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com
> > > wrote:
> > >
> > > I have 3 nodes. Nodes 2 and 3 are subscribers of node 1 and
> > I'm trying to
> > > failover from node 1 to node 3. The failover command succeeds
> > but the
> > > database of node 3 is still read-only and the origin is still
> > node 1. I
> > > don't have the same problem when doing failover with only two
> > nodes because
> > > the set is moved immediately by failedNode.
> > >
> > > failedNode (in the code below) is able to set the provider
> > successfully.
> > >
> > > Some code elsewhere is actually moving the replication set.
> > Where is that
> > > code? Is it in slon or slonik or in the sql scripts?
> > >
> > > How do I find out that slon caught the signal and is doing
> > the right thing
> > > in response to the signal?
> > >
> > > 784 raise notice ''failedNode: set % has other direct
> > receivers -
> > > change providers only'', v_row.set_id;
> > > 785 -- ----
> > > 786 -- Backup node is not the
> > only direct
> > > subscriber. This
> > > 787 -- means that at this moment,
> > we redirect
> > > all direct
> > > 788 -- subscribers to receive
> > from the backup
> > > node, and the
> > > 789 -- backup node itself to
> > receive from
> > > another one.
> > > 790 -- The admin utility will
> > wait for the slon
> > > engine to
> > > 791 -- restart and then call
> > failedNode2() on
> > > the node with
> > > 792 -- the highest SYNC and
> > redirect this to it
> > > on
> > > 793 -- backup node later.
> > > 794 -- ----
> > > ... etc ...
> > > 811
> > > 812 -- ----
> > > 813 -- Make sure the node daemon will restart
> > > 814 -- ----
> > > 815 notify "_ at CLUSTERNAME@_Restart";
> > > 816
> > >
> > > -Fiel
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > > _______________________________________________
> > > Slony1-general mailing list
> > > Slony1-general at gborg.postgresql.org
> > > http://gborg.postgresql.org/mailman/listinfo/slony1-general
> >
> >
> >
> >
> >
> >
> >
>
> > _______________________________________________
> > Slony1-general mailing list
> > Slony1-general at gborg.postgresql.org
> > http://gborg.postgresql.org/mailman/listinfo/slony1-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20051005/2d700635/attachment-0001.html


More information about the Slony1-general mailing list