[Slony1-general] slony not replicating after re-initializing the slave cluster

Tue Sep 21 16:23:52 PDT 2010

I got some time and decided to test this again on some VM boxes rather 
than our live environment, but had little luck.

Simply so I can have this logged in the mailing list with what was done 
(and hopefully a solution in the near future), here's my process I 
preformed.

I created two clusters that mirror our live boxes as closely as possible.
- - PostgreSQL version 8.4.2
- - Slony version 1.2.20
- - both installed via source

I created the master cluster as:
# initdb -D /usr/local/pgsql/encoding_master/ --locale=C --encoding=LATIN1
I created the slave cluster as:
# initdb -D /usr/local/pgsql/encoding_slave/ --locale=C --encoding=SQL_ASCII

I set up a master -> slave slony cluster and replicated a single table 
in a single replication set, and verified that replication was taking place.

I wrote a small daemon that inserts a row into the table being 
replicated on the master once a minute.

I brought down the slon daemons, and preformed a pg_dump on the slave:
# pg_dump -p 5433 -Fc postgres > /tmp/postgres_dump.sql

I brought down the slave cluster, then created a new one with the LATIN1 
encoding:
# initdb -D /usr/local/pgsql/encoding_slave_latin/ --locale=C 
--encoding=SQL_ASCII

I brought the cluster online and started up the slon daemons. The slave 
slon daemon reported remoteworker and remote listener threads, and 
reported increasing SYNC numbers, however did not actually replicate 
data from the master to the slave, and _slony.sl_log_1 on the master 
grew in numbers with every insert that took place . NOTE: This is the 
same behavior I experienced before on our live servers.

I then executed the following:
#!/bin/bash
. etc/slony.env
echo "Repair config"

slonik <<_EOF_
cluster name = $CLUSTERNAME ;
node 1 admin conninfo = 'dbname=$MASTERDBNAME host=$MASTERHOST 
port=$MASTERPORT user=$REPUSER';
node 2 admin conninfo = 'dbname=$SLAVEDBNAME host=$SLAVEHOST 
port=$SLAVEPORT user=$REPUSER';
REPAIR CONFIG (SET ID = 1, EVENT NODE = 1, EXECUTE ONLY ON = 2);
_EOF_

it executed without error, however replication did not start working, 
and the slave daemon started acting weird with the child process being 
terminated constantly, then restarted every 10 seconds just to be 
terminated again.

Monitoring the slave slon log, The following chunk gets repeated every 
10 seconds (with the addition of another que event each time):

2010-09-21 16:39:08 MDT CONFIG main: slon version 1.2.20 starting up
2010-09-21 16:39:08 MDT DEBUG2 slon: watchdog process started
2010-09-21 16:39:08 MDT DEBUG2 slon: watchdog ready - pid = 13845
2010-09-21 16:39:08 MDT DEBUG2 slon: worker process created - pid = 14308
2010-09-21 16:39:08 MDT CONFIG main: local node id = 2
2010-09-21 16:39:08 MDT DEBUG2 main: main process started
2010-09-21 16:39:08 MDT CONFIG main: launching sched_start_mainloop
2010-09-21 16:39:08 MDT CONFIG main: loading current cluster configuration
2010-09-21 16:39:08 MDT CONFIG storeNode: no_id=1 no_comment='Server 1'
2010-09-21 16:39:08 MDT DEBUG2 setNodeLastEvent: no_id=1 event_seq=1008
2010-09-21 16:39:08 MDT CONFIG storePath: pa_server=1 pa_client=2 
pa_conninfo="dbname=postgres host=172.16.44.133 port=5432 user=postgres" 
pa_connretry=10
2010-09-21 16:39:08 MDT CONFIG storeListen: li_origin=1 li_receiver=2 
li_provider=1
2010-09-21 16:39:08 MDT CONFIG storeSet: set_id=1 set_origin=1 
set_comment='logtester'
2010-09-21 16:39:08 MDT WARN   remoteWorker_wakeup: node 1 - no worker 
thread
2010-09-21 16:39:08 MDT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + 
worker signaled)
2010-09-21 16:39:08 MDT CONFIG storeSubscribe: sub_set=1 sub_provider=1 
sub_forward='f'
2010-09-21 16:39:08 MDT WARN   remoteWorker_wakeup: node 1 - no worker 
thread
2010-09-21 16:39:08 MDT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + 
worker signaled)
2010-09-21 16:39:08 MDT CONFIG enableSubscription: sub_set=1
2010-09-21 16:39:08 MDT WARN   remoteWorker_wakeup: node 1 - no worker 
thread
2010-09-21 16:39:08 MDT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + 
worker signaled)
2010-09-21 16:39:08 MDT DEBUG2 main: last local event sequence = 575
2010-09-21 16:39:08 MDT CONFIG main: configuration complete - starting 
threads
2010-09-21 16:39:08 MDT DEBUG1 localListenThread: thread starts
2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 
host=172.16.44.133 user=postgres" is 80402
NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=14291
2010-09-21 16:39:08 MDT CONFIG enableNode: no_id=1
2010-09-21 16:39:08 MDT DEBUG1 main: running scheduler mainloop
2010-09-21 16:39:08 MDT DEBUG1 remoteWorkerThread_1: thread starts
2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 
host=172.16.44.133 user=postgres" is 80402
2010-09-21 16:39:08 MDT DEBUG4 remoteWorkerThread_1: update provider 
configuration
2010-09-21 16:39:08 MDT DEBUG1 remoteWorkerThread_1: helper thread for 
provider 1 created
2010-09-21 16:39:08 MDT DEBUG4 remoteWorkerThread_1: added active set 1 
to provider 1
2010-09-21 16:39:08 MDT DEBUG2 remoteWorkerThread_1: set 1 starts at 
ssy_seqno 1008
2010-09-21 16:39:08 MDT DEBUG1 remoteListenThread_1: thread starts
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: start listening for 
event origin 1
2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres 
host=172.16.44.133 port=5432 user=postgres" is 80402
2010-09-21 16:39:08 MDT DEBUG1 cleanupThread: thread starts
2010-09-21 16:39:08 MDT DEBUG4 cleanupThread: bias = 35383
2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 
host=172.16.44.133 user=postgres" is 80402
2010-09-21 16:39:08 MDT DEBUG1 syncThread: thread starts
2010-09-21 16:39:08 MDT DEBUG4 version for "dbname=postgres port=5433 
host=172.16.44.133 user=postgres" is 80402
2010-09-21 16:39:08 MDT DEBUG1 remoteListenThread_1: connected to 
'dbname=postgres host=172.16.44.133 port=5432 user=postgres'
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1009 
RESET_CONFIG
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1010 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1011 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1012 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1013 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1014 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1015 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1016 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1017 
RESET_CONFIG
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1018 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1019 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1020 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1021 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1022 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1023 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1024 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1025 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1026 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1027 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1028 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1029 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1030 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1031 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1032 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1033 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1034 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1035 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1036 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1037 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1038 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1039 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1040 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1041 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1042 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1043 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1044 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1045 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1046 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1047 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1048 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1049 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1050 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1051 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1052 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1053 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1054 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1055 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1056 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1057 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1058 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1059 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1060 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1061 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1062 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1063 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1064 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1065 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1066 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1067 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1068 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1069 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteListenThread_1: queue event 1,1070 SYNC
2010-09-21 16:39:08 MDT DEBUG2 remoteWorkerThread_1: Received event 
1,1009 RESET_CONFIG
2010-09-21 16:39:09 MDT DEBUG2 slon: child terminated status: 11; pid: 
14308, current worker pid: 14308
2010-09-21 16:39:09 MDT DEBUG1 slon: restart of worker in 10 seconds

At this point, slony is not replicating, the slave daemon's child 
process won't stay alive, and restarting the deamons yield nothing.

I stopped the slon daemons, brought down the slave database, restored 
the OLD slave database that replication was working on, and started up 
the daemons. of the events that were qued up in sl_log_1, half of them 
replicated over to the slave before the slave daemon's child process was 
terminated. After this, the slave daemon entered into the same  loop 
that it was doing on the previous slave (with log printed above).

So now at this point, i'm on the old slave cluster, and slony 
replication is dead. I'm not too concerned at the moment to get it 
running again, as it's just a test system. The thing that I need working 
however is replication on the new cluster.

NEXT STEP:
Next I'll be resetting everything and repeating the process. However 
this time around I'll be including OID's in the pg_dump and monitoring 
how slony behaves with that approach.

Any suggestions / error corrections are appreciated.

- Brian F

Steve Singer wrote:
> On Thu, 16 Sep 2010, Brian Fehrle wrote:
>
> What are your slon processes logging/printing?   Are the remoteWorker 
> threads actually processing events or is the remoteListener the only 
> thread logging?  Are rows the events in sl_event being marked as 
> confirmed in sl_confirm (on the slave? is the data making it back to 
> the sl_confirm on the master? I suspect not otherwise sl_log_1 
> wouldn't keep growing)
>
> Also remember to run REPAIR CONFIG 
> (http://www.slony.info/documentation/stmtrepairconfig.html) after 
> restoring from a pg_dump.
>
>
>
>
>> Hi all,
>>    Due to realizing that our 1 master -> 1 slave slony cluster had
>> different encodings on each box, we attempted to fix that. Our master
>> had encoding of LATIN1 and our slave had the encoding of SQL_ASCII (they
>> were initialized so long ago, we don't know who did it or why it was
>> done that way).
>>    Slony worked with this setup, but we wanted to fix it, due to some
>> other problems, by moving the slave from SQL_ASCII to LATIN1.
>>
>>    So we brought down the slon daemons, brought down the slave database
>> and rebooted the physical machine the slave is on (dozens of cron jobs
>> we commented out and wanted to verify they were all dead).
>>
>>    When we rebooted the machine, we brought the slave postgres cluster
>> online and preformed a pg_dump on the entire database (including the
>> _slony schema). Then we brought down the postgres cluster, ran initdb to
>> create a new one with LATIN1 encoding, brought the new cluster online
>> and ran a pg_restore on it with the dump file we created before.
>>
>>    After that we restarted our cron jobs, which also started up the two
>> slon daemons, we started monitoring the slave and noticed that no
>> updates are being applied. We're running the slon daemons with -s 60000
>> (force a sync every 60 seconds) and a -x flag to get some slony logs for
>> log shipping. These slony logs that are generated with -x are empty
>> (they have the slony header and footer, but no insert data).
>>
>>    On the master, if I do a # select * from _slony.sl_status; I get
>> back that there are anywhere between 0 - 2 events, and a lag time no
>> greater than 3 minutes. Monitoring the slave slony log output also
>> verifies that events are being receved and processed without error every
>> minute.
>>
>>    Again, on the master, # select count(*) from _slony.sl_log_1;
>> returns with 12,000 + rows, and it continually grows. So from what I can
>> tell, the master is getting events qued up, but not pushing them in the
>> events to the slave, each event is completely void of data, and it looks
>> like sl_log_1 just keeps building up.
>>
>>    One theory is that even though we have an exact data dump of the old
>> slave cluster restored to to the new slave cluster, since the encoding
>> has changed perhaps the master doesn't recognize the slave as the same
>> slave it had before. If thats the case, is there any way we can get it
>> to recognize it without having to rebuild the slony cluster? (rebuilding
>> the cluster would mean a few days of work if not week/s).
>>
>>    Other than that, I'm unsure what to make of this. I've restarted the
>> daemons, and neither the master nor the slave daemon report any errors
>> in the logs. I verified that the triggers exist on the master as they
>> should (we never touched the master anyways, but still checking
>> everything), the path to the slave remained the same as the previous
>> slave (same dbname, host, port, user).
>>
>>    Any thoughts or things I can check would be appreciated. Or if my
>> theory about the master not recognizing the new slave cluster as the old
>> one is correct, then if we can fix that that would be great.
>>
>> thanks in advance,
>>                      Brian F
>> _______________________________________________
>> Slony1-general mailing list
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
>>
>