[Slony1-general] Replication fails on Corrupted PK index

Sun Apr 27 10:06:54 PDT 2014

On 04/25/14 07:32, Raghav wrote:
> Hi,
>
> I am facing an issue, where one of primary key index got corrupted, due
> to which duplicate rows got inserted into the table. Same duplicate rows
> are captured by slony in its sl_log_* and failing to replay those events
> on slave with duplicate key violation and aborting. Later I found a
> Slony Faq where it describes the similar situation(I am guessing) under
> "*8.4. *Replication Fails - Unique Constraint Violation" section in this
> link. http://main.slony.info/documentation/1.2/faq.html#AEN7936

Let me make sure that I understand the issue correctly. You say that 
your master database has a corrupted index, which allowed duplicate keys 
to be inserted. This means that without removing those duplicate rows, 
you could not REINDEX the table on the master either. Replication fails 
because you do not have the same index corruption on the replica(s). 
Correct me if I misunderstood.

The following is based on this assumption.

>
> I could able to fix it by removing entries manually from the sl_log_* to
> continue syncing.
> My curious question is, why slony sl_log_* are not capturing distinct
> values. I believe if it captures the distinct values in sl_log_* then
> there's no point of sync abort. Correct me if am wrong.
> Also, the link describes the situation of Slony 1.2 version whereas am
> using latest version of slony. Is this still expected in this version of
> slony ?

Slony log tables are capturing INSERT, UPDATE, DELETE and TRUNCATE 
operations. Except for the TRUNCATE, all those operations are captured 
on the row level. These insert operations that led to duplicate keys did 
happen, even though they should not have. But Slony has A) no way of 
detecting that and B) it isn't Slony's duty to second guess the 
integrity of the master database.

What you might have here is that your replica contains all your data up 
to a second before the index got corrupted. I say "might" because this 
sort of index corruption could have had other side effects, like not 
finding rows on UPDATE or DELETE, which therefore contain old data or 
should not be there at all. Unless your application checks the query 
execution results for the number of rows affected, those UPDATE/DELETE 
failures would not have been detected and since no rows were affected, 
no such operations would have been captured and replicated either.

If you indeed find that your replica contains "better" data than the 
corrupted master, then you could correct the situation easier by 
performing a FAILOVER, or by uninstalling Slony and rebuilding the 
cluster based on the healthy replica as initial master.

Regards,
Jan

>
> Details of issue for your reference:
> *_Version:_*
>
> PG 9.3, RHEL, Slony 2.2
>
> *_On Primary_*
>
> postgres=# \d dtest
>          Table "public.dtest"
>   Column |     Type      | Modifiers
> --------+---------------+-----------
>   id     | integer       | not null
>   name   | character(10) |
> Indexes:
>      "dtest_pkey" PRIMARY KEY, btree (id)
> Triggers:
>      _rf_logtrigger AFTER INSERT OR DELETE OR UPDATE ON dtest FOR EACH
> ROW EXECUTE PROCEDURE _rf.logtrigger('_rf', '2', 'k')
>      _rf_truncatetrigger BEFORE TRUNCATE ON dtest FOR EACH STATEMENT
> EXECUTE PROCEDURE _rf.log_truncate('2')
> Disabled triggers:
>      _rf_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON dtest FOR EACH
> ROW EXECUTE PROCEDURE _rf.denyaccess('_rf')
>      _rf_truncatedeny BEFORE TRUNCATE ON dtest FOR EACH STATEMENT
> EXECUTE PROCEDURE _rf.deny_truncate()
>
> postgres=# select * from dtest ;     /// Due to corruption you have
> duplicate rows.
>   id |    name
> ----+------------
>    1 | A
>    2 | B
>    3 | C
>    4 | D
>    4 | D
>
> postgres=# select * from _rf.sl_log_2;  /// This has captured the event
> as it is
>   log_origin | log_txid | log_tableid | log_actionseq | log_tablenspname
> | log_tablerelname | log_cmdtype | log_cmdupdncols |       log_cmdargs
> ------------+----------+-------------+---------------+------------------+------------------+-------------+-----------------+------------------
>            1 |    12292 |           2 |             9 | public
> | dtest            | T           |               0 | {}
>            1 |    12377 |           2 |            10 | public
> | dtest            | I           |               0 | {id,1,name,"A
>            1 |    12389 |           2 |            11 | public
> | dtest            | I           |               0 | {id,2,name,"B
>            1 |    12400 |           2 |            12 | public
> | dtest            | I           |               0 | {id,3,name,"C
>            1 |    13605 |           2 |            13 | public
> | dtest            | I           |               0 | {id,4,name,"D
>            1 |    13611 |           2 |            14 | public
> | dtest            | I           |               0 | {id,4,name,"D
> (6 rows)
>
>
> *_Slony Log information:_*
>
> 2014-04-11 11:20:09 PDT ERROR  remoteWorkerThread_1_1: error at end of
> COPY IN: ERROR:  duplicate key value violates unique constraint "dtest_pkey"
> DETAIL:  Key (id)=(4) already exists.
> CONTEXT:  SQL statement "INSERT INTO "public"."dtest" ("id", "name")
> VALUES ($1, $2);"
> COPY sl_log_2, line 1: "1       13611   2       14      public  dtest
> I       0       {id,4,name,"D         "}"
> 2014-04-11 11:20:09 PDT ERROR  remoteWorkerThread_1_1: failed SYNC's log
> selection query was 'COPY ( select log_origin, log_txid, NULL::integer,
> log_actionseq, NULL::text, NULL::text, log_cmdtype, NULL::integer,
> log_cmdargs from "_rf".sl_log_script where log_origin = 1 and log_txid
>  >= "pg_catalog".txid_snapshot_xmax('13607:13607:') and log_txid <
> '13614' and "pg_catalog".txid_visible_in_snapshot(log_txid,
> '13614:13614:') union all select log_origin, log_txid, NULL::integer,
> log_actionseq, NULL::text, NULL::text, log_cmdtype, NULL::integer,
> log_cmdargs from "_rf".sl_log_script where log_origin = 1 and log_txid
> in (select * from "pg_catalog".txid_snapshot_xip('13607:13607:') except
> select * from "pg_catalog".txid_snapshot_xip('13614:13614:') ) union all
> select log_origin, log_txid, log_tableid, log_actionseq,
> log_tablenspname, log_tablerelname, log_cmdtype, log_cmdupdncols,
> log_cmdargs from "_rf".sl_log_2 where log_origin = 1 and log_tableid in
> (1) and log_txid >= '13607' and log_txid < '13614' and
> "pg_catalog".txid_visible_in_snapshot(log_txid, '13614:13614:') union
> all select log_origin, log_txid, log_tableid, log_actionseq,
> log_tablenspname, log_tablerelname, log_cmdtype, log_cmdupdncols,
> log_cmdargs from "_rf".sl_log_2 where log_origin = 1 and log_tableid in
> (1) and log_txid in (select * from
> "pg_catalog".txid_snapshot_xip('13607:13607:') except select * from
> "pg_catalog".txid_snapshot_xip('13614:13614:') ) union all select
> log_origin, log_txid, log_tableid, log_actionseq, log_tablenspname,
> log_tablerelname, log_cmdtype, log_cmdupdncols, log_cmdargs from
> "_rf".sl_log_2 where log_origin = 1 and log_tableid in (2) and log_txid
>  >= '13607' and log_txid < '13614' and
> "pg_catalog".txid_visible_in_snapshot(log_txid, '13614:13614:') union
> all select log_origin, log_txid, log_tableid, log_actionseq,
> log_tablenspname, log_tablerelname, log_cmdtype, log_cmdupdncols,
> log_cmdargs from "_rf".sl_log_2 where log_origin = 1 and log_tableid in
> (2) and log_txid in (select * from
> "pg_catalog".txid_snapshot_xip('13607:13607:') except select * from
> "pg_catalog".txid_snapshot_xip('13614:13614:') ) union all select
> log_origin, log_txid, log_tableid, log_actionseq, log_tablenspname,
> log_tablerelname, log_cmdtype, log_cmdupdncols, log_cmdargs from
> "_rf".sl_log_2 where log_origin = 1 and log_tableid in (3) and log_txid
>  >= '13607' and log_txid < '13614' and
> "pg_catalog".txid_visible_in_snapshot(log_txid, '13614:13614:') union
> all select log_origin, log_txid, log_tableid, log_actionseq,
> log_tablenspname, log_tablerelname, log_cmdtype, log_cmdupdncols,
> log_cmdargs from "_rf".sl_log_2 where log_origin = 1 and log_tableid in
> (3) and log_txid in (select * from
> "pg_catalog".txid_snapshot_xip('13607:13607:') except select * from
> "pg_catalog".txid_snapshot_xip('13614:13614:') ) order by log_actionseq)
> TO STDOUT'
> 2014-04-11 11:20:09 PDT ERROR  remoteWorkerThread_1: SYNC aborted
>
> Thank you in advance.
>
> --
> Regards
> Raghav
> Blog: htt://raghavt.blogspot.com/ <http://raghavt.blogspot.com/>
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general
>

-- 
Jan Wieck
Senior Software Engineer
http://slony.info