Sridevi R write2sridevi at gmail.com
Sat Jun 15 01:46:03 PDT 2013
Jan,

You are right. Tweaking the tcp keep alive parameters helped.

Now slon.conf contains:

tcp_keepalive=true
tcp_keepalive_interval=45
tcp_keepalive_count=20
tcp_keepalive_idle=30
cleanup_interval=30

Thanks a lot for the timely response.

- Sridevi



On Thu, Jun 13, 2013 at 11:46 PM, Jan Wieck <JanWieck at yahoo.com> wrote:

> On 06/13/13 06:25, Sridevi R wrote:
> > Hello Jan,
> >
> > The Master and Slave DBs talk through a firewall.
> > VIP IPs and SNAT IPs are used in pg_hba.conf.
> >
> > The corresponding messages in the postgres server log:
> >
> > 2013-06-13 09:46:21.224 GMT,,,6630,"10.4.2.2:42031
> > <http://10.4.2.2:42031>",51b994ed.19e6,1,"",2013-06-13 09:46:21
> > GMT,,0,LOG,08P01,"incomplete startup packet",,,,,,,,,""
> > 2013-06-13 09:57:38.596 GMT,"postgres","db01",6634,"<ip address printed
> > here>:53924",51b994f7.19ea,1,"idle",2013-06-13 09:46:31
> > GMT,28/0,0,LOG,08006,"could not receive data from client: Connection
> > reset by peer",,,,,,,,,"slon.node_1_listen"
> > 2013-06-13 09:57:38.596 GMT,"postgres","db01",6634,"<ip address printed
> > here>:53924",51b994f7.19ea,2,"idle",2013-06-13 09:46:31
> > GMT,28/0,0,LOG,08P01,"unexpected EOF on client
> > connection",,,,,,,,,"slon.node_1_listen"
> > 2013-06-13 09:57:38.607 GMT,"postgres","db01",6637,"<ip address printed
> > here>:53926",51b994f9.19ed,1,"idle",2013-06-13 09:46:33
> > GMT,32/0,0,LOG,08006,"could not receive data from client: Connection
> > reset by peer",,,,,,,,,"slon.subscriber_1_provider_1"
> > 2013-06-13 09:57:38.607 GMT,"postgres","db01",6637,"<ip address printed
> > here>:53926",51b994f9.19ed,2,"idle",2013-06-13 09:46:33
> > GMT,32/0,0,LOG,08P01,"unexpected EOF on client
> > connection",,,,,,,,,"slon.subscriber_1_provider_1"
> > 2013-06-13 09:57:38.608 GMT,"postgres","db01",6635,"<ip address printed
> > here>:53925",51b994f7.19eb,1,"idle",2013-06-13 09:46:31
> > GMT,31/0,0,LOG,08006,"could not receive data from client: Connection
> > reset by peer",,,,,,,,,"slon.node_1_listen"
> > 2013-06-13 09:57:38.608 GMT,"postgres","db01",6635,"<ip address printed
> > here>:53925",51b994f7.19eb,2,"idle",2013-06-13 09:46:31
> > GMT,31/0,0,LOG,08P01,"unexpected EOF on client
> > connection",,,,,,,,,"slon.node_1_listen"
> >
> > The client slon log contains:
> > 2013-06-13 09:57:38 GMT FATAL  cleanupThread: "begin;lock table
> > "_xx_cluster".sl_config_lock;select "_xx_cluster".cleanupEvent('10
> > minutes'::interval);commit;" - server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.
>
> This all can very well be a slightly too eager firewall dropping idle
> connections. Have you tried to enable TCP keep alive options that kick
> in after something like 30 seconds? If not, enable them on both, the PG
> server and the Slony side. That usually prevents those firewall issues.
>
>
> Jan
>
>
> >
> >
> > Thanks,
> > Sridevi
> >
> >
> >
> >
> >
> > On Thu, Jun 13, 2013 at 12:02 AM, Jan Wieck <JanWieck at yahoo.com
> > <mailto:JanWieck at yahoo.com>> wrote:
> >
> >     On 06/12/13 10:17, Sridevi R wrote:
> >     > Jan,
> >     >
> >     > Thanks for the reply.
> >     >
> >     > The only errors in the slon log are failure of cleanupThread.
> >     > child process is restarting right after the cleanupThread Failure.
> >     > This occurs approximately every 10 minutes since cleanup_interval
> >     is set
> >     > to 10 minutes.
> >     >
> >     > Here is a sample from the log again:
> >     >
> >     > 2013-06-06 14:23:27 GMT FATAL  cleanupThread: "begin;lock table
> >     > "_xx_cluster".sl_config_lock;select "_xx_cluster".cleanupEvent('10
> >     > minutes'::interval);commit;" - server closed the connection
> >     unexpectedly
> >     >     This probably means the server terminated abnormally
> >     >     before or while processing the request.
> >     > 2013-06-06 14:23:27 GMT CONFIG slon: child terminated signal: 9;
> pid:
> >     > 16135, current worker pid: 16135
> >     > 2013-06-06 14:23:27 GMT CONFIG slon: restart of worker in 10
> seconds
> >
> >     "server closed the connection unexpectedly" ...
> >
> >     Is this connection by any chance through some firewall or NAT gateway
> >     that will drop idle connections?
> >
> >     What are the corresponding postmaster server log entries? Since slony
> >     reports an unexpected connection drop from the server, the server
> must
> >     have some message in its log too, because the client never sent the
> 'X'
> >     libpq protocol message.
> >
> >
> >     Jan
> >
> >
> >     >
> >     > Thanks ,
> >     > Sridevi
> >     >
> >     >
> >     > On Wed, Jun 12, 2013 at 7:33 PM, Jan Wieck <JanWieck at yahoo.com
> >     <mailto:JanWieck at yahoo.com>
> >     > <mailto:JanWieck at yahoo.com <mailto:JanWieck at yahoo.com>>> wrote:
> >     >
> >     >     On 06/12/13 07:14, Sridevi R wrote:
> >     >     > Hello,
> >     >     >
> >     >     > The slony logs are consistently posting this error:
> >     >     >
> >     >     > 2013-06-12 10:01:05 GMT FATAL  cleanupThread: "begin;lock
> table
> >     >     > "_xx_cluster".sl_config_lock;select
> >     "_xx_cluster".cleanupEvent('10
> >     >     > minutes'::interval);commit;" - server closed the connection
> >     >     unexpectedly
> >     >     > 2013-06-12 10:12:24 GMT FATAL  cleanupThread: "begin;lock
> table
> >     >     > "_xx_cluster".sl_config_lock;select
> >     "_xx_cluster".cleanupEvent('10
> >     >     > minutes'::interval);commit;" - server closed the connection
> >     >     unexpectedly
> >     >     >
> >     >     > checked and found that sl_confirm table is not cleaned up.
> >     cleanup
> >     >     event
> >     >     > never succeeds.
> >     >     > Additionally, the child processes terminates and restarts
> >     after each
> >     >     > such cleanup failure.
> >     >     >
> >     >     > 2013-06-11 11:20:04 GMT CONFIG slon: child terminated
> >     signal: 9; pid:
> >     >     > 20172, current worker pid: 20172
> >     >     > 2013-06-11 11:20:04 GMT CONFIG slon: restart of worker in 10
> >     seconds
> >     >     >
> >     >     > When cleanup is run manually, on the psql prompt it runs to
> >     completion
> >     >     > without any issues and cleans up sl_event and sl_confirm
> tables
> >     >     > "begin;lock table "_xx_cluster".sl_config_lock;select
> >     >     > "_xx_cluster".cleanupEvent('10 minutes'::interval);commit;"
> >     >     >
> >     >     > Soln version: 2.1.2
> >     >     >
> >     >     > Any help/insight would be greatly appreciated.
> >     >
> >     >     Slon kills its worker(s) with signal 9 (SIGKILL) when it needs
> to
> >     >     restart, like when there are errors in event processing or if
> it
> >     >     receives certain signals. Are there any other errors in the
> >     slon log or
> >     >     is something on the machine sending signals to slon?
> >     >
> >     >
> >     >     Jan
> >     >
> >     >     >
> >     >     > Thanks,
> >     >     > Sridevi
> >     >     >
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > Slony1-general mailing list
> >     >     > Slony1-general at lists.slony.info
> >     <mailto:Slony1-general at lists.slony.info>
> >     >     <mailto:Slony1-general at lists.slony.info
> >     <mailto:Slony1-general at lists.slony.info>>
> >     >     > http://lists.slony.info/mailman/listinfo/slony1-general
> >     >     >
> >     >
> >     >
> >     >     --
> >     >     Anyone who trades liberty for security deserves neither
> >     >     liberty nor security. -- Benjamin Franklin
> >     >
> >     >
> >
> >
> >     --
> >     Anyone who trades liberty for security deserves neither
> >     liberty nor security. -- Benjamin Franklin
> >
> >
> >
> >
> > _______________________________________________
> > Slony1-general mailing list
> > Slony1-general at lists.slony.info
> > http://lists.slony.info/mailman/listinfo/slony1-general
> >
>
>
> --
> Anyone who trades liberty for security deserves neither
> liberty nor security. -- Benjamin Franklin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20130615/cab5c0d1/attachment.htm 


More information about the Slony1-general mailing list