[Slony1-general] Replication problem of "BIG" table - COPY event in a loop

Fri Mar 14 06:24:31 PDT 2008

Jan Wieck wrote:
> On 3/13/2008 12:25 PM, Emanuel Petr wrote:
>> Cédric Villemain wrote:
>>> Le Thursday 13 March 2008, Emanuel Petr a écrit :
>>>> Hi all,
>>>> we have problem to replicate 12 GB table.
>>>>
>>>> Note: Other smaller tables were replicated without problem.
>>>>
>>>> Here is what I see on "slave" node.
>>>> $ grep 'action"' /var/log/slony1/slon-db.log
>>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action"
>>>> NOTICE:  truncate of "public"."action" failed - doing delete
>>>>
>>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action"
>>>> NOTICE:  truncate of "public"."action" failed - doing delete
>>>> DEBUG2 remoteWorkerThread_1: 6584477358 bytes copied for table
>>>> "public"."action"
>>>>
>>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action"
>>>> NOTICE:  truncate of "public"."action" failed - doing delete
>>>> DEBUG2 remoteWorkerThread_1: 6587572735 bytes copied for table
>>>> "public"."action"
>>>>
>>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action"
>>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action"
>>>> NOTICE:  truncate of "public"."action" failed - doing delete
>>>>
>>>> I can't find any error message in the log.
>>>
>>> I think you probably have made other try before.
>>
>> No, there was only one try. It was aim to show you, that COPY action 
>> for this table is in a loop.
> 
> The question is what aborts the subscription process. Since the copy 
> operation in the second and third attempt above did succeed, it might 
> not be related to that table itself, but something that happens later. 
> Can you provide the entire log from one "prepare to copy ..." to the next?
> 
> 
> Jan
> 

Yes, you have true. I found out what aborts the subscription process - 
look at 'cleanupThread' and further.

DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action"
DEBUG2 remoteWorkerThread_1: all tables for set 2 found on subscriber
DEBUG2 remoteWorkerThread_1: copy table "public"."action"
DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action"
DEBUG2 remoteWorkerThread_1:  nodeon73 is 0
NOTICE:  truncate of "public"."action" failed - doing delete
DEBUG2 syncThread: new sl_action_seq 1 - SYNC 3
DEBUG2 localListenThread: Received event 2,3 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1380846 SYNC
DEBUG2 syncThread: new sl_action_seq 1 - SYNC 119
....
DEBUG2 remoteListenThread_1: queue event 1,1381413 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1381414 SYNC
DEBUG2 remoteWorkerThread_1: 6614093174 bytes copied for table 
"public"."action"
DEBUG2 remoteListenThread_1: queue event 1,1381415 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1382888 SYNC
...
DEBUG2 localListenThread: Received event 2,419 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1382889 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1382890 SYNC
FATAL  cleanupThread: "delete from "_test_cluster".sl_log_1 where 
log_origin = '2' and log_xid < '562381542'; delete from 
"_test_cluster".sl_log_2 where log_origin = '2' and log_xid < 
'562381542'; delete from "_test_cluster".sl_seqlog where seql_origin = 
'2' and seql_ev_seqno < '3'; select "_test_cluster".logswitch_finish(); 
" - ERROR:  cancelling statement due to statement timeout
CONTEXT:  SQL statement "lock table "_test_cluster".sl_config_lock"
PL/pgSQL function "logswitch_finish" line 9 at SQL statement
DEBUG2 slon_retry() from pid=20337
DEBUG1 slon: retry requested
DEBUG2 slon: notify worker process to shutdown
INFO   remoteListenThread_1: disconnecting from 'dbname=test host=test 
user=slony password=pass'
DEBUG1 syncThread: thread done
DEBUG1 remoteListenThread_1: thread done
DEBUG1 localListenThread: thread done
DEBUG1 main: scheduler mainloop returned
DEBUG2 main: wait for remote threads
DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled)
DEBUG1 slon: child termination timeout - kill child
DEBUG2 slon: child terminated status: 9; pid: 20337, current worker pid: 
20337
DEBUG1 slon: restart of worker in 10 seconds
CONFIG main: slon version 1.2.9 starting up
DEBUG2 slon: watchdog process started
DEBUG2 slon: watchdog ready - pid = 20336
DEBUG2 slon: worker process created - pid = 11621
CONFIG main: local node id = 2
DEBUG2 main: main process started
CONFIG main: launching sched_start_mainloop
CONFIG main: loading current cluster configuration
CONFIG storeNode: no_id=1 no_comment='Primary Node 1'
DEBUG2 setNodeLastEvent: no_id=1 event_seq=1380820
CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="dbname=test 
host=test user=slony password=pass" pa_connretry=10
CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
CONFIG storeSet: set_id=1 set_origin=1 set_comment='test tables'
WARN   remoteWorker_wakeup: node 1 - no worker thread
DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled)
CONFIG storeSet: set_id=2 set_origin=1 set_comment='test action table'
WARN   remoteWorker_wakeup: node 1 - no worker thread
DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled)
CONFIG storeSubscribe: sub_set=2 sub_provider=1 sub_forward='f'
WARN   remoteWorker_wakeup: node 1 - no worker thread
DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled)
DEBUG2 main: last local event sequence = 419
CONFIG main: configuration complete - starting threads
DEBUG1 localListenThread: thread starts
NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=20342
CONFIG enableNode: no_id=1
DEBUG1 main: running scheduler mainloop
DEBUG1 cleanupThread: thread starts
DEBUG1 remoteListenThread_1: thread starts
DEBUG2 remoteListenThread_1: start listening for event origin 1
DEBUG1 remoteWorkerThread_1: thread starts
DEBUG1 syncThread: thread starts
DEBUG1 remoteListenThread_1: connected to 'dbname=test host=test 
user=slony password=pass'
DEBUG2 remoteListenThread_1: queue event 1,1380821 ENABLE_SUBSCRIPTION
DEBUG2 remoteListenThread_1: queue event 1,1380822 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1380823 SYNC
...
DEBUG2 remoteListenThread_1: queue event 1,1381003 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1381004 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1381005 SYNC
DEBUG2 remoteWorkerThread_1: Received event 1,1380821 ENABLE_SUBSCRIPTION
DEBUG2 remoteListenThread_1: queue event 1,1381006 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1381007 SYNC
...
DEBUG2 remoteListenThread_1: queue event 1,1383140 SYNC
DEBUG2 remoteListenThread_1: queue event 1,1383141 SYNC
DEBUG2 syncThread: new sl_action_seq 1 - SYNC 468
DEBUG2 localListenThread: Received event 2,468 SYNC
DEBUG1 copy_set 2
DEBUG1 remoteWorkerThread_1: connected to provider DB
DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action"

Connection was restarted  duo to "CleanupThread: statement timeout" and 
subscription process started from the scratch.

WHAT HAPPENED BEHIND:
After the table was copied. Slony has started this process (procpid:21563)
select "_test_cluster".finishTableAfterCopy(1); analyze
"public"."action";

which was running for longtime (in my case cca 70min)

and process (procpid:21562)
delete from "_test_cluster".sl_log_1 where log_origin = '2' and log_xid 
< '562440300'; delete from "....
was waiting for procpid:21563.

Postgres was configured with "statement_timeout = 3600000" (60min)

http://www.postgresql.org/docs/8.1/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT

Now, you can see what was the problem :) The Postgres abort this process 
and replication started again.

SOLUTION:
Check your 'statement_timeout' and ensure that "analyze" has enough time 
to finish.

e.g.
# postgresql.conf

statement_timeout = 10800000 # (3 hours)

or disabled it
statement_timeout = 0  #0 is disabled, in milliseconds

Thanks for help ;)

>>
>>>
>>> If for any reason, slony can not truncate the table before 
>>> replicating it, it delete every lines.
>>
>> The table on "slave" node was blank before replication start.
>> And the message "truncate of ... failed - doing delete" appears for 
>> another tables, which were replicated without problem. So I don't 
>> think it's the problem.
>>
>> There is problem only with bigger (over 10GB) tables.
>>
>>>
>>> 68GB/12GB give about 5 tests to replicate the table ?
>>>
>>> I ommit the last versions of everything you are using.... (os, 
>>> kernel, postgres and slony)
>>>
>>> You must truncate manualy the tables on the 'slave', then restarting 
>>> your replica from scratch.
>>
>> And now I'm not able to work with this "badly" replicated table. Each 
>> command on this table hang up. Note: Slony is stopped.
>>
>>
>>>
>>>> COPY event for this "big" table is in a loop and table size on "slave"
>>>> node is still growing.
>>>>
>>>> On "Master" node, the table size is 12GB.
>>>> On "Slave" node the tables size was 68GB before I have stopped the
>>>> replication.
>>>>
>>>>
>>>> -------------
>>>>
>>>> OS: Ubuntu 6.06.2 LTS , 2.6.15-29-amd64-server, x86_64
>>>>
>>>> DB: postgresql-8.1
>>>>
>>>> SLONY: Version: 1.2.9-2ubuntu0~dapper0
>>>>
>>>>
>>>> Detail of our problematic "12GB" table
>>>> =# \d+ action;
>>>>                                              Table "public.action"
>>>>     Column   |            Type             |
>>>> Modifiers                      | Description
>>>> ------------+-----------------------------+-------------------------------- 
>>>>
>>>> ---------------------+------------- id         | 
>>>> integer                    | not null default
>>>> nextval('action_id_seq'::regclass) |
>>>>   clientid   | integer                     |
>>>>
>>>>   action     | integer                     | not null
>>>>
>>>>   response   | integer                     |
>>>>
>>>>   startdate  | timestamp without time zone | not null default now()
>>>>
>>>>   clienttrid | character varying(128)      | not null
>>>>
>>>>   enddate    | timestamp without time zone |
>>>>
>>>>   servertrid | character varying(128)      |
>>>>
>>>> Indexes:
>>>>      "action_pkey" PRIMARY KEY, btree (id)
>>>>      "action_servertrid_key" UNIQUE, btree (servertrid)
>>>>      "action_action_idx" btree ("action")
>>>>      "action_clientid_idx" btree (clientid)
>>>>      "action_response_idx" btree (response)
>>>>      "action_startdate_idx" btree (startdate)
>>>> Foreign-key constraints:
>>>>      "action_action_fkey" FOREIGN KEY ("action") REFERENCES 
>>>> enum_action(id)
>>>>      "action_clientid_fkey" FOREIGN KEY (clientid) REFERENCES 
>>>> "login"(id)
>>>>      "action_response_fkey" FOREIGN KEY (response) REFERENCES
>>>> enum_error(id)
>>>>
>>>>
>>>> ---------------
>>>>
>>>> Does anyone have an idea what could be wrong?
>>>>
>>>> Thanks,
>>>> Petr
>>>> _______________________________________________
>>>> Slony1-general mailing list
>>>> Slony1-general at lists.slony.info
>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>
>>>
>>>
>> _______________________________________________
>> Slony1-general mailing list
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
> 
>