[Slony1-general] Re: Data loss in cleanupEvent()

Tue Jul 21 16:43:52 PDT 2009

Hi,

I did some additional testing and I now have a way of reproducing data loss. It looks like the actual problem is in 
logswitch_finnish(); but telling cleanupEvent() to leave data around for a longer period prevents it form happening.

Follow these steps:

1. Open 2 psql terminals and connect to the master database with both.

2. Make sure there is no log switch in progress. When I did this both sl_log tables were empty.

3. In terminal 1 run:

BEGIN;
INSERT INTO sometable VALUES (...);

4. In terminal 2 run:

SELECT _clustername.logswitch_start();
SELECT _clustername.logswitch_finish();
-- (logswitch_finish() will just sit there waiting)

5. In terminal 1 run:

COMMIT;
-- (This will also cause logswitch_finish() in terminal 2 to complete)

6. Check both sl_log tables - they will be empty.
7. Check the table on slave node - the new row won't be there.

On my test setup I get data loss every time.

-----------------------------------------------------------------------
WARNING: I'm just blindly guessing here. Do things even work that way?

What I think may be happening :
- my transaction starts
- logswitch_finnish() is called
- there are no visible old rows around which logswitch_finnish() could detect and determine it should not truncate 
sl_log. My transaction is generating new rows at this time, but they are not visible to logswitch_finnish().
- logswitch_finnish() executes a TRUNCATE statement. This statement just sits there waiting for a lock on sl_log.
- my tranaction commits
- truncate gets a lock on sl_log and immediately destroys all the rows generated by my transaction.

I had cleanup_interval set to 1 minute during testing but it didn't seem to affect the results - transactions lasting 
only 20 seconds were also lost. The reason why setting cleanup_interval to 6 hours made this problem go away on our 
production cluster could be that this made some statements from up to 6 hours ago visible to logswitch_finnish() and it 
knew it shouldn't truncate the log tables.

/End blind guessing.
-----------------------------------------------------------------------

Does any of this make sense?

Regards,
Aleksander