[Slony1-commit] By cbbrowne: Added a section on Locking Issues to help address the

Mon Apr 25 19:27:29 PDT 2005

Log Message:
-----------
Added a section on Locking Issues to help address the problems people
may run into.

Modified Files:
--------------
    slony1-engine/doc/adminguide:
        bestpractices.sgml (r1.1 -> r1.2)
        ddlchanges.sgml (r1.14 -> r1.15)
        filelist.sgml (r1.11 -> r1.12)
        slonik_ref.sgml (r1.22 -> r1.23)
        slony.sgml (r1.18 -> r1.19)

Added Files:
-----------
    slony1-engine/doc/adminguide:
        locking.sgml (r1.1)

-------------- next part --------------
Index: slonik_ref.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/slonik_ref.sgml,v
retrieving revision 1.22
retrieving revision 1.23
diff -Ldoc/adminguide/slonik_ref.sgml -Ldoc/adminguide/slonik_ref.sgml -u -w -r1.22 -r1.23

--- doc/adminguide/slonik_ref.sgml
+++ doc/adminguide/slonik_ref.sgml
@@ -1710,7 +1710,11 @@
      trigger function) before it can wait for every concurrent
      transaction to finish. At the same time it cannot hold an open
      transaction to the same database itself since this would result in
-     blocking itself forever.
+    blocking itself forever.</para>
+
+    <para> Note that this is a <link linkend="locking"> locking
+    operation, </link> which means that it can get stuck behind other
+    database activity.
      
      <variablelist>
       <varlistentry><term><literal> ID = ival </literal></term>
@@ -1815,8 +1819,11 @@
      origin).  You would probably prefer to <command>MOVE SET</command>
      instead of <command>FAILOVER</command>, if at all possible, as
      <command>FAILOVER</command> winds up discarding the old origin
-     node as being corrupted.
+     node as being corrupted.</para>
      
+    <para> Note that this is a <link linkend="locking"> locking
+    operation, </link> which means that it can get stuck behind other
+    database activity.
 
      <variablelist>
       <varlistentry><term><literal> ID = ival </literal></term>
@@ -1977,8 +1984,12 @@
     
     <para> See also the warnings in <xref linkend="ddlchanges">.</para>
 
-    <para> Note that at the start of this event, all tables in the
-    specified set are unlocked via the function
+    <para> Note that this is a <link linkend="locking"> locking
+    operation, </link> which means that it can get stuck behind other
+    database activity.
+     
+    <para> At the start of this event, all tables in the specified set
+    are unlocked via the function
     <function>alterTableRestore(tab_id)</function>.  After the SQL
     script has run, they are returned to <quote>replicating
     state</quote> using
Index: ddlchanges.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/ddlchanges.sgml,v
retrieving revision 1.14
retrieving revision 1.15
diff -Ldoc/adminguide/ddlchanges.sgml -Ldoc/adminguide/ddlchanges.sgml -u -w -r1.14 -r1.15
--- doc/adminguide/ddlchanges.sgml
+++ doc/adminguide/ddlchanges.sgml
@@ -115,8 +115,9 @@
 facility is somewhat fragile and fairly dangerous.  Making DDL changes
 must not be done in a sloppy or cavalier manner.  If your applications
 do not have fairly stable SQL schemas, then using &slony1; for
-replication is likely to be fraught with trouble and
-frustration.</para>
+replication is likely to be fraught with trouble and frustration.  See
+the section on <link linkend="locking"> locking issues </link> for
+more discussion of related issues.</para>
 
 <para>There is an article on how to manage &slony1; schema changes
 here: <ulink url="http://www.varlena.com/varlena/GeneralBits/88.php">
Index: bestpractices.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/bestpractices.sgml,v
retrieving revision 1.1
retrieving revision 1.2
diff -Ldoc/adminguide/bestpractices.sgml -Ldoc/adminguide/bestpractices.sgml -u -w -r1.1 -r1.2
--- doc/adminguide/bestpractices.sgml
+++ doc/adminguide/bestpractices.sgml
@@ -18,8 +18,8 @@
 servers. </para>
 
 <para> There are, however, a number of things that early adopters of
-&slony1; have discovered which can at least help to suggest some
-policies you might want to consider. </para>
+&slony1; have discovered which can at least help to suggest the sorts
+of policies you might want to consider. </para>
 
 <itemizedlist>
 
@@ -134,6 +134,48 @@
 discusses the different sorts of <quote>deletion</quote> that &slony1;
 supports.  </para> </listitem>
 
+<listitem><para> <link linkend="Locking"> Locking issues </link>
+</para>
+
+<para> Certain &slony1; operations, notably <link
+linkend="stmtsetaddtable"> <command>set add table</command> </link>,
+<link linkend="stmtmoveset"> <command> move set</command> </link>,
+<link linkend="stmtlockset"> <command> lock set </command> </link>,
+and <link linkend="stmtddlscript"> <command>execute script</command>
+</link> require acquiring <emphasis>exclusive locks</emphasis> on the
+tables being replicated. </para>
+
+<para> Depending on the kind of activity on the databases, this may or
+may not have the effect of requiring a (hopefully brief) database
+outage. </para> </listitem>
+
+<listitem id="slonyuser"><para> &slony1;-specific user names. </para>
+
+<para> It has proven useful to define a <command>slony</command> user
+for use by &slony1;, as distinct from a generic
+<command>postgres</command> or <command>pgsql</command> user.  </para>
+
+<para> If all sorts of automatic <quote>maintenance</quote>
+activities, such as <command>vacuum</command>ing and performing
+backups, are performed under the <quote>ownership</quote> of a single
+&postgres; user, it turns out to be pretty easy to run into deadlock
+problems. </para>
+
+<para> For instance, a series of <command>vacuums</command> that
+unexpectedly run against a database that has a large
+<command>SUBSCRIBE_SET</command> event under way may run into a
+deadlock which would roll back several hours worth of data copying
+work.</para>
+
+<para> If, instead, different maintenance roles are performed by
+different users, you may, during vital operations such as
+<command>SUBSCRIBE_SET</command>, lock out other users at the
+<filename>pg_hba.conf</filename> level, only allowing the
+<command>slony</command> user in, which substantially reduces the risk
+of problems while the subscription is in progress.
+</para>
+</listitem>
+
 <listitem><para> listen path management </para> </listitem>
 
 <listitem><para> path configuration </para> </listitem>
Index: slony.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/slony.sgml,v
retrieving revision 1.18
retrieving revision 1.19
diff -Ldoc/adminguide/slony.sgml -Ldoc/adminguide/slony.sgml -u -w -r1.18 -r1.19
--- doc/adminguide/slony.sgml
+++ doc/adminguide/slony.sgml
@@ -49,6 +49,7 @@
  &failover;
  &listenpaths;
  &plainpaths;
+ &locking;
  &addthings;
  &dropthings;
  &logshipping;
Index: filelist.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/filelist.sgml,v
retrieving revision 1.11
retrieving revision 1.12
diff -Ldoc/adminguide/filelist.sgml -Ldoc/adminguide/filelist.sgml -u -w -r1.11 -r1.12
--- doc/adminguide/filelist.sgml
+++ doc/adminguide/filelist.sgml
@@ -38,6 +38,7 @@
 <!entity slonybook          SYSTEM "slony.sgml">
 <!entity logshipping        SYSTEM "logshipping.sgml">
 <!entity bestpractices      SYSTEM "bestpractices.sgml">
+<!entity locking            SYSTEM "locking.sgml">
 
 <!-- back matter -->
 <!entity biblio     SYSTEM "biblio.sgml">
--- /dev/null
+++ doc/adminguide/locking.sgml
@@ -0,0 +1,179 @@
+<!-- $Id: locking.sgml,v 1.1 2005/04/25 18:27:25 cbbrowne Exp $ --> 
+<sect1 id="locking"> <title>Locking Issues</title>
+
+<indexterm><primary>locking issues</primary></indexterm>
+
+<para> One of the usual merits the use, by &postgres;, of
+Multi-Version Concurrency Control (<acronym>MVCC</acronym>) is that
+this eliminates a whole host of reasons to need to lock database
+objects.  On some other database systems, you need to acquire a table
+lock in order to insert data into the table; that can
+<emphasis>severely</emphasis> hinder performance.  On other systems,
+read locks can impede writes; with <acronym>MVCC</acronym>, &postgres;
+eliminates that whole class of locks in that <quote>old reads</quote>
+can access <quote>old tuples.</quote> Most of the time, this allows
+the gentle user of &postgres; to not need to worry very much about
+locks. </para>
+
+<para> Unfortunately, there are several sorts of &slony1; events that
+do require exclusive locks on &postgres; tables, with the result that
+modifying &slony1; configuration can bring back some of those
+<quote>locking irritations.</quote>  In particular:</para>
+
+<itemizedlist>
+
+<listitem><para><link linkend="stmtsetaddtable"> <command>set add
+table</command> </link></para>
+
+<para> A momentary table lock must be acquired on the
+<quote>origin</quote> node in order to add the trigger that collects
+updates for that table.</para>
+</listitem>
+
+<listitem><para><link linkend="stmtmoveset"> <command> move
+set</command> </link></para>
+
+<para> When a set origin is shifted from one node to another, locks
+must be acquired on the tables on both the old origin and the new
+origin in order to change the triggers on the tables.
+</para></listitem>
+
+<listitem><para><link linkend="stmtlockset"> <command> lock set
+</command> </link> </para>
+
+<para> This operation expressly requests locks on the tables in a
+given replication set on the origin node.</para>
+</listitem>
+
+<listitem><para><link linkend="stmtddlscript"> <command>execute
+script</command> </link> </para>
+
+<para> This operation runs a set of SQL queries; in order for it to
+work, the &slony1; triggers must be removed, followed by the query
+(which potentially updates the data) running, followed by triggers
+being restored.  The operation therefore must acquire table locks on
+all replicated tables on each node. </para>
+</listitem>
+
+<listitem><para> During the <command>COPY_SET</command> event on a new
+subscriber </para>
+
+<para> In a sense, this is the least provocative scenario, since,
+before the replication set has been populated, it is pretty reasonable
+to say that the node is <quote>unusable</quote> and that &slony1;
+could reasonably expect exclusive access to the node. </para>
+</listitem>
+
+</itemizedlist>
+
+<para> Each of these actions requires, at some point, modifying each
+of the tables in the affected replication set, which requires
+acquiring an exclusive lock on the table.  Some users that have tried
+running these operations on &slony1; nodes that were actively
+servicing applications have experienced difficulties with deadlocks
+and/or with the operations hanging up. </para>
+
+<para> The obvious question: <quote>What to do about such
+deadlocks?</quote> </para>
+
+<para> Several possibilities admit themselves: </para>
+
+<itemizedlist>
+
+<listitem><para> Announce an application outage to avoid deadlocks
+</para>
+
+<para> If you can temporarily block applications from using the
+database, that will provide a window of time during which there is
+nothing running against the database other than administrative
+processes under your control. </para> </listitem>
+
+<listitem><para> Try the operation, hoping for things to work </para> 
+
+<para> Since nothing prevents applications from leaving access locks
+in your way, you may find yourself deadlocked.  But if the number of
+remaining locks are small, you may be able to negotiate with users to
+<quote>get in edgewise.</quote> </para>
+</listitem>
+
+<listitem><para> Use pgpool </para> 
+
+<para> If you can use this or some similar <quote>connection
+broker</quote>, you may be able to tell the connection manager to stop
+using the database for a little while, thereby letting it
+<quote>block</quote> the applications for you.  What would be ideal
+would be for the connection manager to hold up user queries for a
+little while so that the brief database outage looks, to them, like a
+period where things were running slowly.  </para></listitem>
+
+<listitem><para> Rapid Outage Management </para> 
+
+<para> The following procedure may minimize the period of the outage:
+
+<itemizedlist>
+
+<listitem><para> Modify <filename>pg_hba.conf</filename> so that only
+the <link linkend="slonyuser"> <command>slony</command> user </link>
+will have access to the database. </para> </listitem>
+
+<listitem><para> Issue a <command>kill -SIGHUP</command> to the &postgres;  postmaster.</para> 
+
+<para> This will not kill off existing possibly-long-running queries,
+but will prevent new ones from coming in.  There is an application
+impact in that incoming queries will be rejected until the end of the
+process.
+</para>
+</listitem>
+
+<listitem><para> If <quote>all looks good,</quote> then it should be
+safe to proceed with the &slony1; operation. </para> </listitem>
+
+<listitem><para> If some old query is lingering around, you may need
+to <command>kill -SIGQUIT</command> one of the &postgres; processes.
+This will restart the backend and kill off any lingering queries.  You
+probably need to restart the <xref linkend="slon"> processes that
+attach to the node. </para> 
+
+<para> At that point, it will be safe to proceed with the &slony1;
+operation.</para></listitem>
+
+<listitem><para> Reset <filename>pg_hba.conf</filename> to allow other
+users in, and <command>kill -SIGHUP</command> the postmaster to make
+it reload the security configuration. </para> </listitem>
+</itemizedlist>
+
+</para>
+</listitem>
+
+<listitem><para> The section on <link linkend="ddlchanges"> DDL
+Changes </link> suggests some additional techniques that may be
+useful, such as moving tables between replication sets in such a way
+that you minimize the set of tables that need to be
+locked. </para></listitem>
+
+</itemizedlist>
+
+<para> Regrettably, there is no perfect answer to this.  If it is
+<emphasis>necessary</emphasis> to submit a <xref
+linkend="stmtmoveset"> request, then it is presumably
+<emphasis>necessary</emphasis> to accept the brief application outage.
+As &slony1;/<xref linkend="pgpool"> linkages improve, that may become
+a better way to handle this.</para>
+
+</sect1>
+<!-- Keep this comment at the end of the file
+Local variables:
+mode:sgml
+sgml-omittag:nil
+sgml-shorttag:t
+sgml-minimize-attributes:nil
+sgml-always-quote-attributes:t
+sgml-indent-step:1
+sgml-indent-data:t
+sgml-parent-document:"book.sgml"
+sgml-exposed-tags:nil
+sgml-local-catalogs:("/usr/lib/sgml/catalog")
+sgml-local-ecat-files:nil
+End:
+-->
+