Mon Sep 24 16:09:02 PDT 2007
- Previous message: [Slony1-general] bug in deadlock handling?
- Next message: [Slony1-general] bug in deadlock handling?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 9/21/07, Jeff Frost <jeff at frostconsultingllc.com> wrote: > > On Wed, 19 Sep 2007, Jeff Frost wrote: > > > On Wed, 19 Sep 2007, Christopher Browne wrote: > > > >>> Bails out is the wrong description. Clobbered by the pgsql deadlock > >>> detection system, leaving the cluster in an unstable state would be > more > >>> accurate, if that's what happened. I don't know that there's a way to > >>> catch that clobberage and run a "finally" type thing. > >> The thing is, that's not what happens. > >> > >> Unless there's a COMMIT somewhere in the DDL script, in which case all > bets > >> are off, everywhere, the deadlock should lead to one of two things > >> happening: > >> > >> 1. The other process that was holding onto the tables might fail and > roll > >> back, and the DDL script would complete, or > >> > >> 2. The DDL script will fail and roll back, leading to the state of the > >> tables falling back to what it was before DDL script processing began. > >> > >> In either case, the results should leave the node in a consistent > state, > >> either: > >> a) With the DDL request having gone in, or > >> b) With the DDL request *not* having gone in. > >> > >> Unless there's an extra COMMIT in the code (and I just looked at the > code, > >> and Did Not See One), the failure resulting from a deadlock should be > >> benign, restoring the tables to the "previously altered state." > >>> > >>> So, is there a reasonable way to fix this without > >>> droppping/resubscribing the > >>> node? > >>> > >>> > >>> Well, to start with, you might want to figure out why your application > is > >>> taking such aggressive locks. And make sure in the future that it > doesn't > >>> happen again (not much point fixing it if it's just gonna re-occur). > If > >>> you are using a separate superuser account to connect to your database > and > >>> run your slons (generally the "slony" user) then this is really easy > to > >>> do: tweak your pg_hba to only allow connections from slony and then > kick > >>> all active non-slony connections. Revert your pg_hba at the end of the > >>> maintenance. > >>> > >>> If you're willing to experiment with using slony internal functions, > you > >>> could put the table in question into altered state. Something like > this on > >>> the offending node might work. > >>> > >>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE > >>> tab_reloid =3D (SELECT oid FROM pg_class WHERE relname=3D'cart'))); > >>> > >>> Or of course it might mess things up even more. :) > >> The one change to suggest that comes to *my* mind is that we perhaps > ought > >> to change Slony-I to aggressively lock the replicated tables ASAP at > the > >> start of the DDL script event. > >> > >> That will either immediately succeed, and eliminate any possibility of > >> future deadlock, or immediately fail, before we even try to alter > anything. > >> > >> This same change was made to the SUBSCRIBE SET process (well, > "COPY_SET", > >> strictly speaking, but that's not at all user visible...), as we saw > cases > >> where gradually escalating locks on tables led to deadlocks that could > >> waste hours worth of subscription work. > >> > >> But it seems likely to me that there's more to the problem than we're > >> hearing, because deadlock shouldn't cause any corruption of anything - > to > >> the contrary, it may be expected to prevent it. > > > > this is the latest SQL that caused the problem (note there is not a > COMMIT in > > the sql): > > > > -------- > > CREATE TABLE orders.amazon_items > > ( > > id serial NOT NULL, > > order_id integer NOT NULL, > > item_id integer NOT NULL, > > amazon_item_id character varying(14) NOT NULL, > > CONSTRAINT amazon_items_pkey PRIMARY KEY (id), > > CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id) > > REFERENCES orders.orders (id) MATCH SIMPLE > > ON UPDATE NO ACTION ON DELETE NO ACTION, > > CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id) > > REFERENCES orders.items (id) MATCH SIMPLE > > ON UPDATE NO ACTION ON DELETE NO ACTION > > ) > > WITH OIDS; > > ALTER TABLE orders.amazon_items OWNER TO thenerds; > > -------- > > > > It was called by the following slonik script: > > > > -------- > > #!/usr/bin/slonik > > include </nerds/preamble.slonik>; > > > > EXECUTE SCRIPT ( > > SET ID =3D 1, > > FILENAME =3D '/nerds/thenerds.sql', > > EVENT NODE =3D 1 > > ); > > -------- > > > > and caused the following deadlock to occur: > > > > 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR > > remoteWorkerThread_1: "select "_nerdcluster".ddlScript > > _complete_int(1, -1); " PGRES_FATAL_ERROR > > Sep 18 15:27:54 sql1 slon[12252]: [39-2] ERROR: deadlock detected > > Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL: Process 12263 waits > for > > AccessExclusiveLock on relation 121589880 of databas > > e 121589046; blocked by process 12096. > > Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for > > AccessShareLock on relation 121589817 of database 121589046; > > blocked by process 12263. > > > > Which then left the some of the tables on that slave in a bad state > breaking > > replication: > > > > 2007-09-18 15:56:06 EDT ERROR remoteWorkerThread_1: "select > > "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR: > > Slony-I: alterTableRestore(): Table "public"."carts" is not in altered > state > > CONTEXT: SQL statement "SELECT "_nerdcluster".alterTableRestore( $1 )" > > PL/pgSQL function "ddlscript_prepare_int" line 46 at perform > > > > Note that it's just an AccessShareLock that's killing us. Looks like > that's > > caused by a select query which does searches. Our application does not > > produce any extraneous locking, it simply does SELECTS on that server. > > > > Interestingly, before we started using the slave for queries, we would > have > > the deadlocks happen on the master when doing DDL changes, but this > never > > caused the tables on the master to get into a bad state. You could just > > re-run your EXECUTE SCRIPT and it would usually work fine the second > time. > > > > What other info can I provide? > > So, what I'm getting from all this is that while the deadlocks can occur, > slony should gracefully error out and return the tables to their > previously > altered state, but that doesn't seem to happen on these nodes. Note that > it's > only a problem when there's a deadlock on the slave, not on the master. > Should I file this as a bug? If so, do I just need to send an email to > slony1-bugs? > If you can reproduce the problem then yes, absolutely. Ideally, write a new test-case that demonstrates it. Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070924/= aa67f8fc/attachment.htm
- Previous message: [Slony1-general] bug in deadlock handling?
- Next message: [Slony1-general] bug in deadlock handling?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list