[Slony1-general] Uninterrupted Slony Replication

Fri Aug 5 16:22:46 PDT 2011

Hi,

I am using postgresql-8.4 and slony1-1.2.0.3 and i have been able implement
a 4 node replication cluster where nodes communicate successfully with each
other. The way i have went about this is that i have written scripts (say
cluster_setup.sh and subscribe.sh) to be run with slonik. Like run the
script cluster_setup on the master node and then slon daemon's on all the 4
nodes with necessary connection information and finally run subscribe.sh on
the master node again. This works perfectly fine and even when i kill some
of the slons on the different machines, if i start slon again, the
replication at that node picks up where it was left before.

After this i tried automating the whole process so that in case of a network
disconnect/power failure/reboot the replication can continue to work as
normal. So instead of running slon's manually on each machine, i placed a
script having 'bash -U postgres -c "./slon conninfo=" ' command in init.d
directory for each machine. After having all the database replication
running again, i rebooted one of the machines but i could not have the
database replication restored after that. The node which was acting as a
provider to the rebooted machine started showing this error:

2011-08-05 09:25:40 PDTERROR  remoteListenThread_3: "select con_origin,
con_received,     max(con_seqno) as con_seqno,     max(con_timestamp) as
con_timestamp from "_four_node_rep_cluster20".sl_confirm where con_received
<> 2 group by con_origin, con_received" 2011-08-05 09:25:42 PDTERROR
remoteListenThread_3: "select ev_origin, ev_seqno, ev_timestamp,
ev_snapshot,        "pg_catalog".txid_snapshot_xmin(ev_snapshot),
"pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,
ev_data1, ev_data2,        ev_data3, ev_data4,        ev_data5,
ev_data6,        ev_data7, ev_data8 from "_four_node_rep_cluster20".sl_event
e where (e.ev_origin = '3' and e.ev_seqno > '5000000005') or (e.ev_origin =
'4' and e.ev_seqno > '5000000039') order by e.ev_origin, e.ev_seqno limit
40" - no connection to the server

and then the replication wont start working again till the time i reboot all
the nodes. I am guessing it might be the case that the provider node gets
reinitialized on rebooting thats why the replication starts again. I know
slony is used for automated database replication so i was wondering whether
there is any way in which i can make this work without rebooting all the
nodes, which will be inconvenient if the number of nodes increase or for
production server

Any inputs on the above error will be greatly appreciated.

Regards
Dilraj Singh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20110805/d19da692/attachment-0001.htm