[Slony1-hackers] Cleaning the sl_confirm table

Tue Feb 23 00:55:51 PST 2010

I'm sorry for the spam, but it seems you ML manager does'nt like lines
starting with "From ", it cuts the mail body just before it.

Here(s my full mail this time:

Hi,

I have a cool Slony cluster with 12 PGs running for years.
But I noticed some strange data in the origin's sl_confirm table:

SELECT *
FROM _ob2replication.sl_confirm
WHERE con_seqno NOT IN (SELECT ev_seqno FROM _ob2replication.sl_event);

 con_origin | con_received | con_seqno |       con_timestamp
------------+--------------+-----------+----------------------------
        21 |           26 |   3277845 | 2009-09-26 15:01:12.61598
        21 |           27 |   3277845 | 2009-10-22 15:13:26.370957
        21 |           25 |   3277845 | 2009-09-26 14:54:16.162632
        21 |            5 |   3277845 | 2009-10-22 15:13:26.669225

con_timestamp should be some date near today (february 2010)

Or from another server:

 con_origin | con_received | con_seqno |       con_timestamp
------------+--------------+-----------+----------------------------
        21 |           26 |   3277845 | 2009-09-26 15:01:12.61598
        21 |           27 |   3277845 | 2009-09-26 15:01:12.549303
        21 |           22 |   3277845 | 2009-09-26 15:01:12.627592
        21 |            5 |   3277845 | 2009-09-26 15:01:12.831765
        21 |           25 |   3277845 | 2009-09-26 14:54:16.162632

And from another:

 con_origin | con_received | con_seqno |       con_timestamp
------------+--------------+-----------+----------------------------
        21 |           26 |   3277845 | 2009-09-26 15:01:12.61598
        21 |            5 |   3277845 | 2009-09-26 15:01:12.831765
        21 |           27 |   3277845 | 2009-09-26 15:01:12.549303
        21 |           25 |   3277845 | 2009-09-26 14:54:16.162632
        21 |           22 |   3277845 | 2009-09-26 15:01:12.627592
        17 |           25 |         0 | 2010-01-26 11:18:22.45564
        17 |           26 |         0 | 2010-01-26 11:18:22.455672
        17 |           27 |         0 | 2010-01-26 11:18:22.455704
        17 |           22 |         0 | 2010-01-26 11:18:22.455729
        17 |            5 |         0 | 2010-01-26 11:18:22.455753
        17 |           21 |         0 | 2010-01-26 11:18:22.455794
        17 |           12 |         0 | 2010-01-26 11:18:22.455835
        17 |           16 |         0 | 2010-01-26 11:18:22.455861
        17 |            2 |         0 | 2010-01-26 11:18:22.455898
        17 |           24 |         0 | 2010-01-26 11:18:22.455926

On every server I have entries in the sl_confirm tables with events
references which do not exist anymore.

The cluster works, but I don't like thoses entries staying there. What
can I do ? Delete them ?

Those entries should be cleaned up but the cleanupEvent() plpgsql
function, but they are not.

I launched a debugged version of the cleanupEvent() function and here
what I got:
SELECT _ob2replication.mycleanup();

NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND con_received=2
AND con_seqno < 1060869
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND con_received=5
AND con_seqno < 1060869
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=12 AND con_seqno < 1060869
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=16 AND con_seqno < 1060869
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=17 AND con_seqno < 1060869
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=22 AND con_seqno < 3277845
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=23 AND con_seqno < 1060928
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=24 AND con_seqno < 1060869
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=25 AND con_seqno < 3277845
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=26 AND con_seqno < 3277845
NOTICE:  DELETE FROM sl_confirm WHERE con_origin=21 AND
con_received=27 AND con_seqno < 3277845

F*r*o*m origin 21, the legitimate seqno is ~1060869. But the bad entries
in sl_confirm with a seqno ~3277845 are preventing cleanupEvent() to
clean those.

And if I take a look at the seqno range on the master I get:

SELECT con_origin,
 MIN(con_seqno) AS first_con_seqno,
 MAX(con_seqno) AS last_con_seqno,
 MAX(con_seqno) - MIN(con_seqno) AS delta
FROM _ob2replication.sl_confirm
GROUP BY con_origin
ORDER BY con_origin ASC;

 con_origin | first_con_seqno | last_con_seqno |  delta
------------+-----------------+----------------+---------
         2 |         1850804 |        1850903 |      99
         5 |         1437305 |        1437403 |      98
        12 |          992789 |         992888 |      99
        16 |          957046 |         957145 |      99
        17 |          230040 |         230139 |      99
        21 |         1060965 |        3277845 | 2216880
        22 |         4790482 |        4790906 |     424
        23 |         1050820 |        1050919 |      99
        24 |           99661 |          99694 |      33
        25 |         1858636 |        1858734 |      98
        26 |         2629674 |        2629773 |      99
        27 |         1788901 |        1789000 |      99
(12 rows)

As you can see, the node 21 is keeping a huge range of confirmations
while the usual is to keep less than a thousand.

My supposition: Last year in september, there was a replication
maintenance, it is possible we might have to reinstall slony on the
node 21. If this was the case, the seqno was reset. But not every
event on the cluster was cleaned, and some confirm lines in sl_confirm
stayed there.

As every one of those confirm lines in sl_confirm in the cluster refer
to unexistent sl_event, can I delete those sl_confirm lines without fearing
any replication problem ?

Thanks again for the help =)
-- 
Laurent Raufaste
<http://www.glop.org/>