Tue Sep 21 16:23:52 PDT 2010
- Previous message: [Slony1-general] slony not replicating after re-initializing the slave cluster
- Next message: [Slony1-general] slony not replicating after re-initializing the slave cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I got some time and decided to test this again on some VM boxes rather than our live environment, but had little luck. Simply so I can have this logged in the mailing list with what was done (and hopefully a solution in the near future), here's my process I preformed. I created two clusters that mirror our live boxes as closely as possible. - - PostgreSQL version 8.4.2 - - Slony version 1.2.20 - - both installed via source I created the master cluster as: # initdb -D /usr/local/pgsql/encoding_master/ --locale=C --encoding=LATIN1 I created the slave cluster as: # initdb -D /usr/local/pgsql/encoding_slave/ --locale=C --encoding=SQL_ASCII I set up a master -> slave slony cluster and replicated a single table in a single replication set, and verified that replication was taking place. I wrote a small daemon that inserts a row into the table being replicated on the master once a minute. I brought down the slon daemons, and preformed a pg_dump on the slave: # pg_dump -p 5433 -Fc postgres > /tmp/postgres_dump.sql I brought down the slave cluster, then created a new one with the LATIN1 encoding: # initdb -D /usr/local/pgsql/encoding_slave_latin/ --locale=C --encoding=SQL_ASCII I brought the cluster online and started up the slon daemons. The slave slon daemon reported remoteworker and remote listener threads, and reported increasing SYNC numbers, however did not actually replicate data from the master to the slave, and _slony.sl_log_1 on the master grew in numbers with every insert that took place . NOTE: This is the same behavior I experienced before on our live servers. I then executed the following: #!/bin/bash . etc/slony.env echo "Repair config" slonik <<_EOF_ cluster name = $CLUSTERNAME ; node 1 admin conninfo = 'dbname=$MASTERDBNAME host=$MASTERHOST port=$MASTERPORT user=$REPUSER'; node 2 admin conninfo = 'dbname=$SLAVEDBNAME host=$SLAVEHOST port=$SLAVEPORT user=$REPUSER'; REPAIR CONFIG (SET ID = 1, EVENT NODE = 1, EXECUTE ONLY ON = 2); _EOF_ it executed without error, however replication did not start working, and the slave daemon started acting weird with the child process being terminated constantly, then restarted every 10 seconds just to be terminated again. Monitoring the slave slon log, The following chunk gets repeated every 10 seconds (with the addition of another que event each time): 2010-09-21 16:39:08 MDT CONFIG main: slon version 1.2.20 starting up 2010-09-21 16:39:08 MDT DEBUG2 slon: watchdog process started 2010-09-21 16:39:08 MDT DEBUG2 slon: watchdog ready - pid = 13845 2010-09-21 16:39:08 MDT DEBUG2 slon: worker process created - pid = 14308 2010-09-21 16:39:08 MDT CONFIG main: local node id = 2 2010-09-21 16:39:08 MDT DEBUG2 main: main process started 2010-09-21 16:39:08 MDT CONFIG main: launching sched_start_mainloop 2010-09-21 16:39:08 MDT CONFIG main: loading current cluster configuration 2010-09-21 16:39:08 MDT CONFIG storeNode: no_id=1 no_comment='Server 1' 2010-09-21 16:39:08 MDT DEBUG2 setNodeLastEvent: no_id=1 event_seq=1008 2010-09-21 16:39:08 MDT CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="dbname=postgres host=172.16.44.133 port=5432 user=postgres" pa_connretry=10 2010-09-21 16:39:08 MDT CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 2010-09-21 16:39:08 MDT CONFIG storeSet: set_id=1 set_origin=1 set_comment='logtester' 2010-09-21 16:39:08 MDT WARN remoteWorker_wakeup: node 1 - no worker thread 2010-09-21 16:39:08 MDT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2010-09-21 16:39:08 MDT CONFIG storeSubscribe: sub_set=1 sub_provider=1 sub_forward='f' 2010-09-21 16:39:08 MDT WARN remoteWorker_wakeup: node 1 - no worker thread 2010-09-21 16:39:08 MDT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2010-09-21 16:39:08 MDT CONFIG enableSubscription: sub_set=1 2010-09-21 16:39:08 MDT WARN remoteWorker_wakeup: node 1 - no worker thread 2010-09-21 16:39:08 MDT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled) 2010-09-21 16:39:08 MDT DEBUG2 main: last local event sequence = 575 2010-09-21 16:39:08 MDT CONFIG main: configuration complete - starting threads 2010-09-21 16:39:08 MDT DEBUG1 localListenThread: thread starts 2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 host=172.16.44.133 user=postgres" is 80402 NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=14291 2010-09-21 16:39:08 MDT CONFIG enableNode: no_id=1 2010-09-21 16:39:08 MDT DEBUG1 main: running scheduler mainloop 2010-09-21 16:39:08 MDT DEBUG1 remoteWorkerThread_1: thread starts 2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 host=172.16.44.133 user=postgres" is 80402 2010-09-21 16:39:08 MDT DEBUG4 remoteWorkerThread_1: update provider configuration 2010-09-21 16:39:08 MDT DEBUG1 remoteWorkerThread_1: helper thread for provider 1 created 2010-09-21 16:39:08 MDT DEBUG4 remoteWorkerThread_1: added active set 1 to provider 1 2010-09-21 16:39:08 MDT DEBUG2 remoteWorkerThread_1: set 1 starts at ssy_seqno 1008 2010-09-21 16:39:08 MDT DEBUG1 remoteListenThread_1: thread starts 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: start listening for event origin 1 2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres host=172.16.44.133 port=5432 user=postgres" is 80402 2010-09-21 16:39:08 MDT DEBUG1 cleanupThread: thread starts 2010-09-21 16:39:08 MDT DEBUG4 cleanupThread: bias = 35383 2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 host=172.16.44.133 user=postgres" is 80402 2010-09-21 16:39:08 MDT DEBUG1 syncThread: thread starts 2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 host=172.16.44.133 user=postgres" is 80402 2010-09-21 16:39:08 MDT DEBUG1 remoteListenThread_1: connected to 'dbname=postgres host=172.16.44.133 port=5432 user=postgres' 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1009 RESET_CONFIG 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1010 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1011 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1012 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1013 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1014 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1015 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1016 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1017 RESET_CONFIG 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1018 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1019 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1020 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1021 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1022 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1023 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1024 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1025 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1026 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1027 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1028 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1029 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1030 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1031 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1032 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1033 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1034 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1035 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1036 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1037 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1038 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1039 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1040 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1041 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1042 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1043 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1044 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1045 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1046 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1047 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1048 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1049 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1050 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1051 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1052 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1053 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1054 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1055 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1056 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1057 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1058 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1059 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1060 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1061 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1062 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1063 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1064 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1065 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1066 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1067 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1068 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1069 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1070 SYNC 2010-09-21 16:39:08 MDT DEBUG2 remoteWorkerThread_1: Received event 1,1009 RESET_CONFIG 2010-09-21 16:39:09 MDT DEBUG2 slon: child terminated status: 11; pid: 14308, current worker pid: 14308 2010-09-21 16:39:09 MDT DEBUG1 slon: restart of worker in 10 seconds At this point, slony is not replicating, the slave daemon's child process won't stay alive, and restarting the deamons yield nothing. I stopped the slon daemons, brought down the slave database, restored the OLD slave database that replication was working on, and started up the daemons. of the events that were qued up in sl_log_1, half of them replicated over to the slave before the slave daemon's child process was terminated. After this, the slave daemon entered into the same loop that it was doing on the previous slave (with log printed above). So now at this point, i'm on the old slave cluster, and slony replication is dead. I'm not too concerned at the moment to get it running again, as it's just a test system. The thing that I need working however is replication on the new cluster. NEXT STEP: Next I'll be resetting everything and repeating the process. However this time around I'll be including OID's in the pg_dump and monitoring how slony behaves with that approach. Any suggestions / error corrections are appreciated. - Brian F Steve Singer wrote: > On Thu, 16 Sep 2010, Brian Fehrle wrote: > > What are your slon processes logging/printing? Are the remoteWorker > threads actually processing events or is the remoteListener the only > thread logging? Are rows the events in sl_event being marked as > confirmed in sl_confirm (on the slave? is the data making it back to > the sl_confirm on the master? I suspect not otherwise sl_log_1 > wouldn't keep growing) > > Also remember to run REPAIR CONFIG > (http://www.slony.info/documentation/stmtrepairconfig.html) after > restoring from a pg_dump. > > > > >> Hi all, >> Due to realizing that our 1 master -> 1 slave slony cluster had >> different encodings on each box, we attempted to fix that. Our master >> had encoding of LATIN1 and our slave had the encoding of SQL_ASCII (they >> were initialized so long ago, we don't know who did it or why it was >> done that way). >> Slony worked with this setup, but we wanted to fix it, due to some >> other problems, by moving the slave from SQL_ASCII to LATIN1. >> >> So we brought down the slon daemons, brought down the slave database >> and rebooted the physical machine the slave is on (dozens of cron jobs >> we commented out and wanted to verify they were all dead). >> >> When we rebooted the machine, we brought the slave postgres cluster >> online and preformed a pg_dump on the entire database (including the >> _slony schema). Then we brought down the postgres cluster, ran initdb to >> create a new one with LATIN1 encoding, brought the new cluster online >> and ran a pg_restore on it with the dump file we created before. >> >> After that we restarted our cron jobs, which also started up the two >> slon daemons, we started monitoring the slave and noticed that no >> updates are being applied. We're running the slon daemons with -s 60000 >> (force a sync every 60 seconds) and a -x flag to get some slony logs for >> log shipping. These slony logs that are generated with -x are empty >> (they have the slony header and footer, but no insert data). >> >> On the master, if I do a # select * from _slony.sl_status; I get >> back that there are anywhere between 0 - 2 events, and a lag time no >> greater than 3 minutes. Monitoring the slave slony log output also >> verifies that events are being receved and processed without error every >> minute. >> >> Again, on the master, # select count(*) from _slony.sl_log_1; >> returns with 12,000 + rows, and it continually grows. So from what I can >> tell, the master is getting events qued up, but not pushing them in the >> events to the slave, each event is completely void of data, and it looks >> like sl_log_1 just keeps building up. >> >> One theory is that even though we have an exact data dump of the old >> slave cluster restored to to the new slave cluster, since the encoding >> has changed perhaps the master doesn't recognize the slave as the same >> slave it had before. If thats the case, is there any way we can get it >> to recognize it without having to rebuild the slony cluster? (rebuilding >> the cluster would mean a few days of work if not week/s). >> >> Other than that, I'm unsure what to make of this. I've restarted the >> daemons, and neither the master nor the slave daemon report any errors >> in the logs. I verified that the triggers exist on the master as they >> should (we never touched the master anyways, but still checking >> everything), the path to the slave remained the same as the previous >> slave (same dbname, host, port, user). >> >> Any thoughts or things I can check would be appreciated. Or if my >> theory about the master not recognizing the new slave cluster as the old >> one is correct, then if we can fix that that would be great. >> >> thanks in advance, >> Brian F >> _______________________________________________ >> Slony1-general mailing list >> Slony1-general at lists.slony.info >> http://lists.slony.info/mailman/listinfo/slony1-general >> >
- Previous message: [Slony1-general] slony not replicating after re-initializing the slave cluster
- Next message: [Slony1-general] slony not replicating after re-initializing the slave cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list