Brian Fehrle brianf at consistentstate.com
Thu Sep 27 11:34:16 PDT 2012
Hi all,

PostgreSQL v 9.1.5 - 9.1.6
Slony version 2.1.0

I'm having an issue that's occurred twice now. I have 4 node slony 
cluster, and one of the operations is to drop a node from replication, 
do maintenance on it, then add it back to replication.

Node 1 = master
Node 2 = slave
Node 3 = slave  -> dropped then readded
Node 4 = slave

I dropped node 2 from replication, then a while later added it back. I 
have my own set of scripts that add it to replication all automatically. 
They have been used dozens of times over the past few months without 
issue. But recently, I've had the issue where after adding back the node 
(in this example, node 3), node 4 suddenly stops syncing. I've attached 
a snip of the log for node 4 in this email. The main thing I'm seeing is 
this:

Node 4 (the one that stopped syncing) was proccessing the following SYNC 
events
5000030355
5000030357
5000030358
5000030359

Then at this point it receives the event recognizing the new node 
addition to the slony cluster.

then we have the error where it's attempting to insert the event 
5000000032 (see log attached).

Are these sequential event id's supposed to be related?

I looked at the sl_event on the node 4 box that's erroring, and did the 
following query:

select now() -  ev_timestamp, ev_seqno from _slony.sl_event where 
ev_origin = 4 order by 1;
(how long ago was this event
     ?column?     |  ev_seqno
-----------------+------------
  01:48:30.7277   | 5000000031
  01:48:40.731095 | 5000000030
  01:48:50.734454 | 5000000029
  01:49:00.735438 | 5000000028
  01:49:10.728117 | 5000000027
  01:49:20.731387 | 5000000026
  01:49:30.734199 | 5000000025
  01:49:40.726768 | 5000000024
  01:49:50.730152 | 5000000023
  01:50:00.732747 | 5000000022
  01:50:10.736126 | 5000000021
  01:50:20.737394 | 5000000020
  01:50:30.733839 | 5000000019
  01:50:40.736928 | 5000000018
  01:50:50.739931 | 5000000017
  01:51:00.742365 | 5000000016
  01:51:10.745724 | 5000000015
  01:51:20.748913 | 5000000014
  01:51:30.751388 | 5000000013
  01:51:40.754439 | 5000000012
  01:51:50.757796 | 5000000011
  01:52:00.760079 | 5000000010
  01:52:10.763595 | 5000000009
  01:52:20.762106 | 5000000008
  01:52:30.764838 | 5000000007
  01:52:40.768154 | 5000000006
  01:52:50.771847 | 5000000005
  01:53:00.775925 | 5000000004
  01:53:38.2839   | 5000000003
  01:53:38.422047 | 5000000002
  01:53:38.435689 | 5000000001
  01:55:29.173397 | 5000000032

NOTE: At the moment I took this snapshot, the st_lag_time for this node 
was '01:55:28.411043', which falls between the last two timestamps above.

What strikes me as 'interesting' is that 5000000032 (the one it's 
erroring on) happened way before the rest of them timestamp wise. So I'm 
thinking that perhaps it was leftover from the previous 'node' before I 
dropped it, and it was never cleaned up?

The last time this happened, I deleted that offending row myself 
(desperate, i know to usually never touch the slony schema data), and it 
actually did completely break slony replication. It's a smaller set so 
wasn't too hard to set up again from scratch.

So I'm going to do some testing to see if it's due to lack of cleanup, 
but thought I'd throw this out there. Is there anything else that could 
possibly cause this?

thanks in advance,
- Brian F

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20120927/e4ac6750/attachment.htm 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: temp.db8.log
Url: http://lists.slony.info/pipermail/slony1-general/attachments/20120927/e4ac6750/attachment.txt 


More information about the Slony1-general mailing list