Fri Mar 14 06:24:31 PDT 2008
- Previous message: [Slony1-general] Replication problem of "BIG" table - COPY event in a loop
- Next message: [Slony1-general] Replication problem of "BIG" table - COPY event in a loop - SOLVED
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jan Wieck wrote: > On 3/13/2008 12:25 PM, Emanuel Petr wrote: >> Cédric Villemain wrote: >>> Le Thursday 13 March 2008, Emanuel Petr a écrit : >>>> Hi all, >>>> we have problem to replicate 12 GB table. >>>> >>>> Note: Other smaller tables were replicated without problem. >>>> >>>> Here is what I see on "slave" node. >>>> $ grep 'action"' /var/log/slony1/slon-db.log >>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action" >>>> NOTICE: truncate of "public"."action" failed - doing delete >>>> >>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action" >>>> NOTICE: truncate of "public"."action" failed - doing delete >>>> DEBUG2 remoteWorkerThread_1: 6584477358 bytes copied for table >>>> "public"."action" >>>> >>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action" >>>> NOTICE: truncate of "public"."action" failed - doing delete >>>> DEBUG2 remoteWorkerThread_1: 6587572735 bytes copied for table >>>> "public"."action" >>>> >>>> DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: copy table "public"."action" >>>> DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action" >>>> NOTICE: truncate of "public"."action" failed - doing delete >>>> >>>> I can't find any error message in the log. >>> >>> I think you probably have made other try before. >> >> No, there was only one try. It was aim to show you, that COPY action >> for this table is in a loop. > > The question is what aborts the subscription process. Since the copy > operation in the second and third attempt above did succeed, it might > not be related to that table itself, but something that happens later. > Can you provide the entire log from one "prepare to copy ..." to the next? > > > Jan > Yes, you have true. I found out what aborts the subscription process - look at 'cleanupThread' and further. DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action" DEBUG2 remoteWorkerThread_1: all tables for set 2 found on subscriber DEBUG2 remoteWorkerThread_1: copy table "public"."action" DEBUG2 remoteWorkerThread_1: Begin COPY of table "public"."action" DEBUG2 remoteWorkerThread_1: nodeon73 is 0 NOTICE: truncate of "public"."action" failed - doing delete DEBUG2 syncThread: new sl_action_seq 1 - SYNC 3 DEBUG2 localListenThread: Received event 2,3 SYNC DEBUG2 remoteListenThread_1: queue event 1,1380846 SYNC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 119 .... DEBUG2 remoteListenThread_1: queue event 1,1381413 SYNC DEBUG2 remoteListenThread_1: queue event 1,1381414 SYNC DEBUG2 remoteWorkerThread_1: 6614093174 bytes copied for table "public"."action" DEBUG2 remoteListenThread_1: queue event 1,1381415 SYNC DEBUG2 remoteListenThread_1: queue event 1,1382888 SYNC ... DEBUG2 localListenThread: Received event 2,419 SYNC DEBUG2 remoteListenThread_1: queue event 1,1382889 SYNC DEBUG2 remoteListenThread_1: queue event 1,1382890 SYNC FATAL cleanupThread: "delete from "_test_cluster".sl_log_1 where log_origin = '2' and log_xid < '562381542'; delete from "_test_cluster".sl_log_2 where log_origin = '2' and log_xid < '562381542'; delete from "_test_cluster".sl_seqlog where seql_origin = '2' and seql_ev_seqno < '3'; select "_test_cluster".logswitch_finish(); " - ERROR: cancelling statement due to statement timeout CONTEXT: SQL statement "lock table "_test_cluster".sl_config_lock" PL/pgSQL function "logswitch_finish" line 9 at SQL statement DEBUG2 slon_retry() from pid=20337 DEBUG1 slon: retry requested DEBUG2 slon: notify worker process to shutdown INFO remoteListenThread_1: disconnecting from 'dbname=test host=test user=slony password=pass' DEBUG1 syncThread: thread done DEBUG1 remoteListenThread_1: thread done DEBUG1 localListenThread: thread done DEBUG1 main: scheduler mainloop returned DEBUG2 main: wait for remote threads DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) DEBUG1 slon: child termination timeout - kill child DEBUG2 slon: child terminated status: 9; pid: 20337, current worker pid: 20337 DEBUG1 slon: restart of worker in 10 seconds CONFIG main: slon version 1.2.9 starting up DEBUG2 slon: watchdog process started DEBUG2 slon: watchdog ready - pid = 20336 DEBUG2 slon: worker process created - pid = 11621 CONFIG main: local node id = 2 DEBUG2 main: main process started CONFIG main: launching sched_start_mainloop CONFIG main: loading current cluster configuration CONFIG storeNode: no_id=1 no_comment='Primary Node 1' DEBUG2 setNodeLastEvent: no_id=1 event_seq=1380820 CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="dbname=test host=test user=slony password=pass" pa_connretry=10 CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 CONFIG storeSet: set_id=1 set_origin=1 set_comment='test tables' WARN remoteWorker_wakeup: node 1 - no worker thread DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) CONFIG storeSet: set_id=2 set_origin=1 set_comment='test action table' WARN remoteWorker_wakeup: node 1 - no worker thread DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) CONFIG storeSubscribe: sub_set=2 sub_provider=1 sub_forward='f' WARN remoteWorker_wakeup: node 1 - no worker thread DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) DEBUG2 main: last local event sequence = 419 CONFIG main: configuration complete - starting threads DEBUG1 localListenThread: thread starts NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=20342 CONFIG enableNode: no_id=1 DEBUG1 main: running scheduler mainloop DEBUG1 cleanupThread: thread starts DEBUG1 remoteListenThread_1: thread starts DEBUG2 remoteListenThread_1: start listening for event origin 1 DEBUG1 remoteWorkerThread_1: thread starts DEBUG1 syncThread: thread starts DEBUG1 remoteListenThread_1: connected to 'dbname=test host=test user=slony password=pass' DEBUG2 remoteListenThread_1: queue event 1,1380821 ENABLE_SUBSCRIPTION DEBUG2 remoteListenThread_1: queue event 1,1380822 SYNC DEBUG2 remoteListenThread_1: queue event 1,1380823 SYNC ... DEBUG2 remoteListenThread_1: queue event 1,1381003 SYNC DEBUG2 remoteListenThread_1: queue event 1,1381004 SYNC DEBUG2 remoteListenThread_1: queue event 1,1381005 SYNC DEBUG2 remoteWorkerThread_1: Received event 1,1380821 ENABLE_SUBSCRIPTION DEBUG2 remoteListenThread_1: queue event 1,1381006 SYNC DEBUG2 remoteListenThread_1: queue event 1,1381007 SYNC ... DEBUG2 remoteListenThread_1: queue event 1,1383140 SYNC DEBUG2 remoteListenThread_1: queue event 1,1383141 SYNC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 468 DEBUG2 localListenThread: Received event 2,468 SYNC DEBUG1 copy_set 2 DEBUG1 remoteWorkerThread_1: connected to provider DB DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."action" Connection was restarted duo to "CleanupThread: statement timeout" and subscription process started from the scratch. WHAT HAPPENED BEHIND: After the table was copied. Slony has started this process (procpid:21563) select "_test_cluster".finishTableAfterCopy(1); analyze "public"."action"; which was running for longtime (in my case cca 70min) and process (procpid:21562) delete from "_test_cluster".sl_log_1 where log_origin = '2' and log_xid < '562440300'; delete from ".... was waiting for procpid:21563. Postgres was configured with "statement_timeout = 3600000" (60min) http://www.postgresql.org/docs/8.1/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT Now, you can see what was the problem :) The Postgres abort this process and replication started again. SOLUTION: Check your 'statement_timeout' and ensure that "analyze" has enough time to finish. e.g. # postgresql.conf statement_timeout = 10800000 # (3 hours) or disabled it statement_timeout = 0 #0 is disabled, in milliseconds Thanks for help ;) >> >>> >>> If for any reason, slony can not truncate the table before >>> replicating it, it delete every lines. >> >> The table on "slave" node was blank before replication start. >> And the message "truncate of ... failed - doing delete" appears for >> another tables, which were replicated without problem. So I don't >> think it's the problem. >> >> There is problem only with bigger (over 10GB) tables. >> >>> >>> 68GB/12GB give about 5 tests to replicate the table ? >>> >>> I ommit the last versions of everything you are using.... (os, >>> kernel, postgres and slony) >>> >>> You must truncate manualy the tables on the 'slave', then restarting >>> your replica from scratch. >> >> And now I'm not able to work with this "badly" replicated table. Each >> command on this table hang up. Note: Slony is stopped. >> >> >>> >>>> COPY event for this "big" table is in a loop and table size on "slave" >>>> node is still growing. >>>> >>>> On "Master" node, the table size is 12GB. >>>> On "Slave" node the tables size was 68GB before I have stopped the >>>> replication. >>>> >>>> >>>> ------------- >>>> >>>> OS: Ubuntu 6.06.2 LTS , 2.6.15-29-amd64-server, x86_64 >>>> >>>> DB: postgresql-8.1 >>>> >>>> SLONY: Version: 1.2.9-2ubuntu0~dapper0 >>>> >>>> >>>> Detail of our problematic "12GB" table >>>> =# \d+ action; >>>> Table "public.action" >>>> Column | Type | >>>> Modifiers | Description >>>> ------------+-----------------------------+-------------------------------- >>>> >>>> ---------------------+------------- id | >>>> integer | not null default >>>> nextval('action_id_seq'::regclass) | >>>> clientid | integer | >>>> >>>> action | integer | not null >>>> >>>> response | integer | >>>> >>>> startdate | timestamp without time zone | not null default now() >>>> >>>> clienttrid | character varying(128) | not null >>>> >>>> enddate | timestamp without time zone | >>>> >>>> servertrid | character varying(128) | >>>> >>>> Indexes: >>>> "action_pkey" PRIMARY KEY, btree (id) >>>> "action_servertrid_key" UNIQUE, btree (servertrid) >>>> "action_action_idx" btree ("action") >>>> "action_clientid_idx" btree (clientid) >>>> "action_response_idx" btree (response) >>>> "action_startdate_idx" btree (startdate) >>>> Foreign-key constraints: >>>> "action_action_fkey" FOREIGN KEY ("action") REFERENCES >>>> enum_action(id) >>>> "action_clientid_fkey" FOREIGN KEY (clientid) REFERENCES >>>> "login"(id) >>>> "action_response_fkey" FOREIGN KEY (response) REFERENCES >>>> enum_error(id) >>>> >>>> >>>> --------------- >>>> >>>> Does anyone have an idea what could be wrong? >>>> >>>> Thanks, >>>> Petr >>>> _______________________________________________ >>>> Slony1-general mailing list >>>> Slony1-general at lists.slony.info >>>> http://lists.slony.info/mailman/listinfo/slony1-general >>> >>> >>> >> _______________________________________________ >> Slony1-general mailing list >> Slony1-general at lists.slony.info >> http://lists.slony.info/mailman/listinfo/slony1-general > >
- Previous message: [Slony1-general] Replication problem of "BIG" table - COPY event in a loop
- Next message: [Slony1-general] Replication problem of "BIG" table - COPY event in a loop - SOLVED
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list