[Slony1-general] Need some help again, keep alives, wide area replication , set failure

Sun Sep 11 23:41:22 PDT 2016

Jan has helped me before, giving me ideas to help with wide area
replication where it seems that the connection drops between a large copy
set and/or an index creation,  when there is no bits crossing the wire and
the connections are dropped by the FW or other so Slony finishes up a
table, index creation and attempts to grab the next table, but the
connection is no longer there, so Slony says failed and attempts again.

I think I'm running into this between my Colo and Amazon, using their VPN
gateway.

Here is the snippet of logs, there is no index here, we dropped it on the
new node, so that it would not fail, but what's odd here is that it copies
all the data and 35 minutes later it reports the time, which tells me it's
doing something, but I'm not sure what, if there is no index on that table.
(there is a primary key with maintains integrity, and we didn't think we
should drop that). but there are no other indexes, so the 35 minutes or
whatever is a mystery..

2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table
"torque"."adimpressions"
2016-09-11 *22:39:39 *PDT CONFIG remoteWorkerThread_1: 76955497834 bytes
copied for table "torque"."adimpressions"
916499:2016-09-11 *23:14:25 *PDT CONFIG remoteWorkerThread_1: 6121.393
seconds to copy table "torque"."impressions"
916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table
"torque".impressions_archive"
916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of
table "torque"."impressions_archive"
916811:2016-09-11 23:14:25 PDT ERROR  remoteWorkerThread_1: "select
"_cls".copyFields(237);"
916907:2016-09-11 23:14:25 PDT WARN   remoteWorkerThread_1: data copy for
set 2 failed 1 times - sleep 15 seconds
917014:2016-09-11 23:14:25 PDT INFO   cleanupThread: 7606.655 seconds for
cleanupEvent()

This run,  I added keep-alives by the following method. (and the timing and
results are the same without them, set 2 fails with error 237).

Adding the following to both slon commands on the origin and the new node

tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300

Now not entirely sure how this is suppose to work and did I not tune this
right. It obviously fails at the 30 minute mark, this is 25 minutes,
however the servers never loses connection (I have a ping (not quite the
same), but it has zero packet loss over the 2+ hours that these attempts to
get things replicated take)). So maybe someone smarter then me can advice
how I should tune the keep alives if that's what is happening.

I thought it would only use the keep-alives if it felt the partner was no
longer there, but since i know pings show there is no connectivity issues,
I'm at a loss. AGAIN :)

Thanks for the assist

Tory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20160911/5b0b9805/attachment.htm