[Slony1-general] Issue when adding node to replication

Thu Sep 27 14:26:31 PDT 2012

On 9/27/2012 3:58 PM, Brian Fehrle wrote:
> Follow up:
>
> I executed this on the master:
> mydatabase=# select * from _slony.sl_event where ev_origin not in
> (select no_id from _slony.sl_node);
>   ev_origin |  ev_seqno  | ev_timestamp          |    ev_snapshot     |
> ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 |
> ev_data6 | ev_data7 | ev_data8
> -----------+------------+-------------------------------+--------------------+---------+----------+----------+----------+----------+----------+----------+----------+----------
>           3 | 5000290161 | 2012-09-27 09:48:03.749424-04 |
> 40580084:40580084: | SYNC    |          |          | |
> |          |          |          |
> (1 row)
>
> There is a row in sl_event that shouldn't be there, because it's
> referencing a node that nolonger exists. I need to add this node back to
> replication, but I don't want to run into the same issue as before. I
> ran a cleanupEvent('10 minute') and it did nothing (even did it with 0
> minutes).
>
> Will this row eventually go away? will it cause issue if we attempt to
> add a new node to replication with node = 3? How can I safely clean this up?

Hmmm,

this actually looks like a more severe race condition or even a bug.

The thing is that processing the DROP NODE and replicating the SYNC are 
different worker threads, since the events originate on different nodes. 
Cleaning out the sl_event is part of dropNode_int(). But the 
remoteWorker for 3 may just have inserted that SYNC concurrently and 
therefore it was left behind.

My guess is that the right solution to this is to clean out everything 
again when a STORE NODE comes along. We had been thinking of making the 
node ID non-reusable to prevent this sort of race conditions.

Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin