Wed Jul 5 06:04:39 PDT 2017
- Previous message: [Slony1-general] failover failure and mysterious missing paths
- Next message: [Slony1-general] failover failure and mysterious missing paths
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Interesting. Of course the behavior evident on inspection indicated something like this must be happening. It seems the doc could be improved on the subject of required paths. I recall some sections indicate it is not harmful to have a path from each node to each other node. What seems not to be spelled out is that for the service to be highly available, to have the ability to failover, each node is *required* to have a path to each other node. On a related point, it would be a lot more convenient if we could give each node a default path instead of re-specifying the same IP for each new subscriber, and a new line of conninfo for every slonik script. Would either of these items be worth writing up in bug tracking and/or providing the solution? If so, could I get that link? Tom ( On 7/2/17, 9:30 PM, "Steve Singer" <steve at ssinger.info> wrote: On Wed, 28 Jun 2017, Tignor, Tom wrote: > > Hi Steve, > Thanks for the info. I was able to repro this problem in testing and saw as soon as I added the missing path back the still-in-process failover op continued on and completed successfully. > We do issue DROP NODEs in the event we need to restore a replica from scratch, which did occur. However, the restore workflow also should issue store paths to/from the new replica node and every other node. Still investigating this. > What still confuses me is the recurring “remoteWorkerThread_X: SYNC” output, despite the fact of not having a configured path. If the path is missing, how does slon continue to get SYNC events? Slon can get events including SYNC from nodes other than the event origin if it has a path to that node. However a slon can only replicate the data from a node it has a path to. Steve > > Tom ( > > > On 6/27/17, 5:04 PM, "Steve Singer" <steve at ssinger.info> wrote: > > On 06/27/2017 11:59 AM, Tignor, Tom wrote: > > > The disableNode() in the makes it look like someone did a DROP NODE > > If the only issue is that your missing active paths in sl_path you can > add/update the paths with slonik. > > > > > > ** > > > > **Hello Slony-I community, > > > > Hoping someone can advise on a strange and serious problem. > > We performed a slony service failover yesterday. For the first time > > ever, our slony service FAILOVER op errored out. We recently expanded > > our cluster to 7 consumers from a single provider. There are no load > > issues during normal operations. As the error output below shows, > > though, our node 4 and node 5 consumers never got the events they > > needed. Here’s where it gets weird: closer inspection has shown that > > node 2->4 and node 2->5 path data went missing out of the service at > > some point. It seems clear that’s the main issue, but in spite of that, > > both node 4 and node 5 continued to find and process node 2 SYNC events > > for a full week! The logs show this happened in spite of multiple restarts. > > > > How can this happen? If missing path data stymies the failover, wouldn’t > > it also prevent normal SYNC processing? > > > > In the case where a failover is begun with inadequate path data, what’s > > the best resolution? Can path data be quickly applied to allow failover > > to succeed? > > > > Thanks in advance for any insights. > > > > ---- failover error ---- > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE: > > calling restart node 1 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55: > > 2017-06-26 18:33:02 > > > > executing preFailover(1,1) on 2 > > > > executing preFailover(1,1) on 3 > > > > executing preFailover(1,1) on 4 > > > > executing preFailover(1,1) on 5 > > > > executing preFailover(1,1) on 6 > > > > executing preFailover(1,1) on 7 > > > > executing preFailover(1,1) on 8 > > > > NOTICE: executing "_ams_cluster".failedNode2 on node 2 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 8 only on event 5000061654, node 4 only > > on event 5000061654, node 5 only on event 5000061655, node 3 only on > > event 5000061662, node 6\ > > > > only on event 5000061654, node 7 only on event 5000061656 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061657, node 5 only > > on event 5000061663, node 3 only on event 5000061663, node 6 only on > > event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663, node 6 only on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting > > for event (2,5000061664). node 4 only on event 5000061663, node 5 only > > on event 5000061663 > > > > ---- node 4 log archive ---- > > > > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: > > pa_server=2 pa_client=4|restart notification' prod4/node4-pathconfig.out > > > > 2017-06-15 15:14:00 UTC [5688] INFO localListenThread: got restart > > notification > > > > 2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4 > > pa_conninfo="dbname=ams > > > > 2017-06-15 15:53:00 UTC [8431] INFO localListenThread: got restart > > notification > > > > 2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2 > > pa_client=4 pa_conninfo="dbname=ams > > > > 2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2 > > pa_client=4 pa_conninfo="dbname=ams > > > > 2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4 > > pa_conninfo="dbname=ams > > > > 2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2 > > > > 2017-06-19 15:11:45 UTC [2707] INFO localListenThread: got restart > > notification > > > > 2017-06-20 18:40:15 UTC [31224] INFO localListenThread: got restart > > notification > > > > 2017-06-21 14:31:42 UTC [6253] INFO localListenThread: got restart > > notification > > > > 2017-06-21 14:35:26 UTC [32367] INFO localListenThread: got restart > > notification > > > > 2017-06-26 18:21:25 UTC [9278] INFO localListenThread: got restart > > notification > > > > 2017-06-26 18:33:04 UTC [28839] INFO localListenThread: got restart > > notification > > > > 2017-06-26 18:33:30 UTC [1785] INFO localListenThread: got restart > > notification > > > > bos-mpt5c:odin-9353 ttignor$ > > > > ---- node 5 log archive ---- > > > > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: > > pa_server=2 pa_client=5|restart notification' prod5/node5-pathconfig.out > > > > 2017-06-15 15:13:56 UTC [20700] INFO localListenThread: got restart > > notification > > > > 2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2 > > pa_client=5 pa_conninfo="dbname=ams > > > > 2017-06-15 15:53:01 UTC [20374] INFO localListenThread: got restart > > notification > > > > 2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 pa_client=5 > > pa_conninfo="dbname=ams > > > > 2017-06-16 17:28:19 UTC [2859] INFO localListenThread: got restart > > notification > > > > 2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2 > > pa_client=5 pa_conninfo="dbname=ams > > > > 2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2 > > > > 2017-06-19 15:11:40 UTC [10753] INFO localListenThread: got restart > > notification > > > > 2017-06-20 18:40:11 UTC [450] INFO localListenThread: got restart > > notification > > > > 2017-06-21 14:31:41 UTC [22300] INFO localListenThread: got restart > > notification > > > > 2017-06-21 14:35:28 UTC [26777] INFO localListenThread: got restart > > notification > > > > 2017-06-26 18:21:27 UTC [28366] INFO localListenThread: got restart > > notification > > > > 2017-06-26 18:33:04 UTC [29345] INFO localListenThread: got restart > > notification > > > > 2017-06-26 18:33:27 UTC [1299] INFO localListenThread: got restart > > notification > > > > bos-mpt5c:odin-9353 ttignor$ > > > > Tom ☺ > > > > > > > > _______________________________________________ > > Slony1-general mailing list > > Slony1-general at lists.slony.info > > http://lists.slony.info/mailman/listinfo/slony1-general > > > > > >
- Previous message: [Slony1-general] failover failure and mysterious missing paths
- Next message: [Slony1-general] failover failure and mysterious missing paths
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list