Wed Dec 19 12:51:16 PST 2007
- Previous message: [Slony1-general] error when running slonik_create_set
- Next message: [Slony1-general] Bad failure mode encountered during EXECUTE SCRIPT (maybe fixed in 1.2.12?)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Running slony 1.2.10 we encountered an issue with a schema upgrade
when it propagated to the secondary in a two-node slony mirror. Our
typical procedure for doing schema upgrades is as follows:
1, Point the application server clients to the secondary (the app
becomes "read-only" as a result)
2. After a few minutes, turn off the slon daemons so we can control
when the schema changes propagate
3. Appy the schema changes (to the primary) via slonik using execute
script also adding and removing tables from the replication set as
needed.
4. After the upgrade completes on the primary, point the app servers
back to it, so the secondary is quiet.
5. Turn on the slon daemons and wait for the changes to propagate to
the secondary.
6. Go to sleep or play video games or what-have-you 8^)
This has worked well for us for dozens (if not hundreds) of upgrades,
with very occasional hiccups in step 5 due to timing related issues
(when adding tables, etc). Last night though we had a problem with
step 4, a wedged pgpool process prevented the app servers from
switching back to the primary. When we tried to switch them, the
timed out trying to connect and spontaneously failed over to the
secondary once again as they are programmed to. Unfortunately this
went unnoticed as we proceeded to step 5. Due to the level of
concurrency, this lead to the following deadlock error on the secondary:
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 9]
ERROR: deadlo
ck detected
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 10]
DETAIL: Proc
ess 6313 waits for AccessExclusiveLock on relation 20355 of database
19833; bloc
ked by process 11991.
Process 11991 waits for AccessShareLock on relation 20083 of
database 19
833; blocked by process 6313.
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 11]
CONTEXT: SQL
statement "lock table "public"."feedback_packed" in access
exclusive mode"
PL/pgSQL function "altertableforreplication" line 53 at
execute statemen
t
SQL statement "SELECT "_radio".alterTableForReplication( $1 )"
PL/pgSQL function "ddlscript_complete_int" line 11 at perform
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 12]
STATEMENT: s
elect "_radio".ddlScript_complete_int(1, -1);
Unfortunately this too went unnoticed at first, what we did notice
was that the app servers were still failed over and we couldn't
connect to the primary. This was quickly tracked down to the problem
with pgpool which was fixed by restarting it. After that, we
succeeded in switching the app servers back to the primary.
At that point the slon daemons were running, but the secondary daemon
was not getting anywhere. A look at the logs revealed this error:
ERROR remoteWorkerThread_1: "select "_radio".ddlScript_prepare_int
(1, -1); " PGRES_FATAL_ERROR ERROR: Slony-I: alterTableRestore():
Table "public"."settlement_log" is not in altered state
INFO <output from /usr/lib/postgresql/bin/slon> CONTEXT: SQL
statement "SELECT "_radio".alterTableRestore( $1 )"
This resulted in a bit of a sinking feeling in my gut, since I knew
this likely meant that none of the replicated tables were in the
altered state, thus none of the slony triggers were present to
prevent data from being written to the secondary. A quick look at the
tables revealed this to be the case. Unfortunately by this time the
app servers were happily writing to the primary, thus the mirror was
broken and the slony pair was hosed.
So the end result was that data got written to the secondary, and is
more-or-less lost because other rows with the same primary key values
got added to the primary subsequent to that.
This situation points to a lack of atomicity in execute script.
Obviously the removal of the slony trigggers and the schema changes
were applied in a separate transaction from the restoration of the
triggers. This leads to this dangerous possibly that the former will
succeed but the latter will fail, leading to the issue I ran into here.
Glancing at the changelog for slony 1.2.12, there appear to be
changes around transaction handling in execute script. Would those
changes prevent this problem? Note I plan to upgrade soon anyway, but
obviously such a fix makes upgrading a more compelling thing for me
to do even sooner.
Thanks.
-Casey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20071219/c9abd121/attachment.htm
- Previous message: [Slony1-general] error when running slonik_create_set
- Next message: [Slony1-general] Bad failure mode encountered during EXECUTE SCRIPT (maybe fixed in 1.2.12?)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list