Wed Jul 26 07:40:40 PDT 2006
- Previous message: [Slony1-general] Using 2PC???
- Next message: [Slony1-general] Using 2PC???
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Andrew Sullivan wrote: > On Sat, Jul 22, 2006 at 04:17:47PM +0200, Florian G. Pflug wrote: >> Andrew Sullivan wrote: >>> Are you sure this will be an improvement? It might just be a >>> foot-gun of a different calibre. >> I'm quite sure that it would be an improvement for at least >> my usecase of slony1. > > I don't like to be mean, but "this will help me" is not a reason to > implement, if it makes things worse for others. The question is not > merely whether it will work for some cases, but whether it improves > the system overall for users. If the tradeoff is that it makes > things better for some, but makes certain other failure cases way > more troublesome, that may be a trade-off we don't want to make. I'd be an _optional_ feature to use. Nobody would force _anyone_ to use 2pc schema updates. But some people, like e.g. I, _could_ use them. I don't see how having the _option_ to do reliable schema updates could hurt anyone. >> The worst that could happen is that you get some >> transaction stuck at prepared state, and need to manually roll them back >> on some nodes. Currently, it's quite easy to destroy your whole cluster >> by messing up a schema change. > > The "on some nodes" thing is part of what is making me uneasy here. > What this says to me is that, to fix the issue that currently it is > easy for someone who hasn't carefully tested a DDL EXECUTE SCRIPT (or > who hasn't read the documentation) to break things, we're going to > introduce a failure mode whereby the DBA may need to intervent > manually on some nodes. That seems to me like a step backwards. If > the problem is that people are doing things which break stuff, then I > suspect we need to improve the interface such that it is harder to > break stuff, rather than introducing a new set of manual-intervention > steps. That worst case would only happen if you lost then network connection to a node while the schema change is still running. And it could be solved by some process (slon, or some other process) that checks for leftover 2pc transactions, and removes them. And even without that safeguard, it's a step *forward*. If you mess up a schema change now, you easily get into a state where only "drop subscription, resubscribe" will get your cluster going again. With the current design of slony, this is about the most painfull operation possible, because during the _whole_ resubscribe the tables on the slaves are locked, _even_ for concurrent _readers_. Logging into all nodes and doing "rollback <transaction>" is just a minor nuisance compared to resubscribing all sets. > Note that I'm not saying "don't do this". I'm saying instead that a > 2PC and a non-2PC approach in the same version of Slony at least > seems a bad idea to me -- it's too complicated. Better to drop > support for non-2PC-capable versions. Moreover, I'm saying that > you'd better have a pretty clean design and a nice set of > administration tools to handle the failure modes, or all you do is > move the pain around to some new place. I can't see the point of > doing a lot of work to get beaten up by people complaining about some > other failure mode. At least for a first implemention, I'm thinking about doing it the other way round. The alogrithmn I'm thinking about would be implemented purely inside slonik, and do the following: 0) Issue "begin;" on the origin. 1) Lock tables on origin. Concurrent readers are OK, but it must block inserts and deletes. 2) Wait until all subscriber have catched up. 3) Issue "begin;" on all subscribers 4) Do the schema-change on all nodes 5) Issue "prepare" on all nodes 6) If all nodes have prepared, issue "commit" on all nodes Otherwise, issue "rollback", and report the error. I see two problems with that approach, but don't see them as showstoppers 1) If slonik crashes after string (5), but before finishing (6), the transaction is prepared on some nodes. In that case, it'd be the job of the admin to .) Find out if it was prepared sucessfully on _all_ nodes .) If yes, "commit" it everwhere .) If not, "rollback" it everywhere. Automating this recovery is possible in theory - but it requires slonik to remeber if step (5)) was successfull on all nodes, or not. 2) It blocks inserts/updates on the origin. Fixing that would require this to algorithm to be integrated deeper into slony itself. greetings, Florian Pflug
- Previous message: [Slony1-general] Using 2PC???
- Next message: [Slony1-general] Using 2PC???
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list