[Slony1-general] bug in deadlock handling?

Wed Sep 19 12:22:52 PDT 2007

Andrew Hammond wrote:
> On 9/18/07, *Jeff Frost* <jeff at frostconsultingllc.com 
> <mailto:jeff at frostconsultingllc.com>> wrote:
>
>     Hi guys, I've got an interesting situation on a slony 1.2.10 3
>     node cluster.
>     Both slaves get their data direct from the master.  Everything has
>     been
>     running well up to a few days ago.  Now every time we try to add a
>     new table
>     to the cluster, we end up with the following error:
>
>
> "Every time"? You have tried more than once?
>
> What I don't see in your problem report is a detailed description, 
> starting with the cluster in a known good state, of exactly what you 
> have done. Without knowing exactly what you have done (typescripts or 
> copies of the shell scripts you used), it's hard to figure out what 
> went wrong.
>
>     2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
>     "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR
>     ERROR:
>     Slony-I: alterTableRestore(): Table "public"."carts" is not in
>     altered state
>     CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore(
>     $1 )"
>     PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
>
>     It looks like the problem is being caused by a deadlock:
>
>     15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
>     remoteWorkerThread_1: "select "_nerdcluster".ddlScript
>     _complete_int(1, -1); " PGRES_FATAL_ERROR
>     Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
>     Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263
>     waits for
>     AccessExclusiveLock on relation 121589880 of databas
>     e 121589046; blocked by process 12096.
>     Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
>     AccessShareLock on relation 121589817 of database 121589046;
>     blocked by process 12263.
>
>     So, my theory is that the execute script alters the tables back to
>     their
>     normal states, doesn't get all the locks it wants and bails out
>     without
>     putting them back to their previously altered state, thus breaking
>     replication.
>
>
> Bails out is the wrong description. Clobbered by the pgsql deadlock 
> detection system, leaving the cluster in an unstable state would be 
> more accurate, if that's what happened. I don't know that there's a 
> way to catch that clobberage and run a "finally" type thing.
The thing is, that's not what happens.

Unless there's a COMMIT somewhere in the DDL script, in which case all 
bets are off, everywhere, the deadlock should lead to one of two things 
happening:

1. The other process that was holding onto the tables might fail and 
roll back, and the DDL script would complete, or

2.  The DDL script will fail and roll back, leading to the state of the 
tables falling back to what it was before DDL script processing began.

In either case, the results should leave the node in a consistent state, 
either:
a) With the DDL request having gone in, or
b) With the DDL request *not* having gone in.

Unless there's an extra COMMIT in the code (and I just looked at the 
code, and Did Not See One), the failure resulting from a deadlock should 
be benign, restoring the tables to the "previously altered state."
>
>     So, is there a reasonable way to fix this without
>     droppping/resubscribing the
>     node?
>
>
> Well, to start with, you might want to figure out why your application 
> is taking such aggressive locks. And make sure in the future that it 
> doesn't happen again (not much point fixing it if it's just gonna 
> re-occur). If you are using a separate superuser account to connect to 
> your database and run your slons (generally the "slony" user) then 
> this is really easy to do: tweak your pg_hba to only allow connections 
> from slony and then kick all active non-slony connections. Revert your 
> pg_hba at the end of the maintenance.
>
> If you're willing to experiment with using slony internal functions, 
> you could put the table in question into altered state. Something like 
> this on the offending node might work.
>
> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE 
> tab_reloid = (SELECT oid FROM pg_class WHERE relname='cart')));
>
> Or of course it might mess things up even more. :)
The one change to suggest that comes to *my* mind is that we perhaps 
ought to change Slony-I to aggressively lock the replicated tables ASAP 
at the start of the DDL script event.

That will either immediately succeed, and eliminate any possibility of 
future deadlock, or immediately fail, before we even try to alter anything.

This same change was made to the SUBSCRIBE SET process (well, 
"COPY_SET", strictly speaking, but that's not at all user visible...), 
as we saw cases where gradually escalating locks on tables led to 
deadlocks that could waste hours worth of subscription work.

But it seems likely to me that there's more to the problem than we're 
hearing, because deadlock shouldn't cause any corruption of anything - 
to the contrary, it may be expected to prevent it.