Jeff threshar at torgo.978.org
Fri Oct 23 10:41:08 PDT 2009
On Oct 23, 2009, at 10:57 AM, Jeff wrote:

> Just ran into this problem - the origin is 8.2, replica is 8.4.1
>
> 2009-10-23 10:47:42 EDT DEBUG1 copy_set 3
> 2009-10-23 10:47:42 EDT DEBUG1 remoteWorkerThread_1: connected to  
> provider DB
> 2009-10-23 10:47:42 EDT WARN   remoteWorkerThread_1: transactions  
> earlier than XID 3820504992 are still in progress
> 2009-10-23 10:47:42 EDT WARN   remoteWorkerThread_1: data copy for  
> set 3 failed - sleep 60 seconds
> NOTICE:  there is no transaction in progress
> 2009-10-23 10:48:42 EDT DEBUG1 copy_set 3
> 2009-10-23 10:48:42 EDT DEBUG1 remoteWorkerThread_1: connected to  
> provider DB
> 2009-10-23 10:48:42 EDT ERROR  remoteWorkerThread_1: Could not lock  
> table "public"."companyinfo" on subscriber
> 2009-10-23 10:48:42 EDT WARN   remoteWorkerThread_1: data copy for  
> set 3 failed - sleep 60 seconds
> NOTICE:  there is no transaction in progress
>
> In the PG log
> LOG:  checkpoint starting: time
> ERROR:  LOCK TABLE can only be used in transaction blocks
> STATEMENT:  lock table "public"."companyinfo";

So I've dug into this and attached a patch to solve it.

In a nutshell in the event loop we start a transaction, then if we are  
not an accept set event we lock the config lock table.  We then zero  
out query1. (this is in remote_worker.c).

The ENABLE_SUBSCRIPTION event runs in a while(true) loop.
First it executes query1 (which thanks to the above, is empty), then  
tries to copy_set.  If copy_set fails for whatever reason we ROLLBACK  
our local conn (query2) and then loop.

The problem with this is when we come back around in the next look  
we're outside of a transaction and one won't be started because query1  
is reset.  This causes LOCK TABLE to barf on PG8.4.  You are forever  
stuck until you restart slon.  This also explains another problem I've  
seen a couple times.

We subscribe to a set with say 3 tables.
The initial subscription fails due to an earlier txn wait.
We copy the first table of hte set successfully.
Then the second table fails to copy due to some DDL issue (perhaps for  
some reason a PK or column is missing).  We issue a rollback but since  
we are not in a txn, nothing happens. The event does not suceed so we  
try again
What happens next is since our previous work wasn't rolled back slony  
sees we've already got teh deny trigger & friends on the first table  
and barfs.   Cue infinite loop fixed only by shutting down slon and  
playing with the sl_ tables.

This patch keeps a count of how many retries we've had on this  
copy_set.  If we are on retry > 0 then we re-issue a start  
transaction, set islolation, and lock the config table. My testing has  
showed that this works.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: copy_set_retry.patch
Type: application/octet-stream
Size: 2598 bytes
Desc: not available
Url : http://lists.slony.info/pipermail/slony1-general/attachments/20091023/36a8cafe/copy_set_retry.obj
-------------- next part --------------



--
Jeff Trout <jeff at jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/





More information about the Slony1-general mailing list