bugzilla-daemon at main.slony.info bugzilla-daemon at main.slony.info
Tue May 18 12:44:40 PDT 2010
http://www.slony.info/bugzilla/show_bug.cgi?id=126

           Summary: slon sometimes does not recover from a network outage
           Product: Slony-I
           Version: 1.2
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: normal
          Priority: low
         Component: slon
        AssignedTo: slony1-bugs at lists.slony.info
        ReportedBy: ssinger at ca.afilias.info
                CC: slony1-bugs at lists.slony.info
   Estimated Hours: 0.0


We've received a report of slon not recovering properly from a network outage.

It appears that the remote listener thread  (8431) encountered a network error
while the network was done.  No network error for the remote worker threads
where observed.    After the error the remote listener for 8431 apparently
continued to queue events (but no logs are available).  Replication started to
fall behind and did not proceed after the network was restored.

Restarting slon made replication work again.


--


My theory is that we were waiting on a socket read() inside of libpq and the
network died.  Since we were not trying to send an event no packets where
generated to notify libpq that the network connection died.

Setting KEEPALIVE on the connections to postgres should address this.  We don't
appear to be doing that currently.



--


2010-05-16 20:59:35 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:35 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:35 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112865 received by 8344
2010-05-16 20:59:35 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112865 received by 8346
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8346: queue event
8346,157585 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8346: UNLISTEN
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: Received event
8346,157585 SYNC
2010-05-16 20:59:38 UTC DEBUG2 calc sync size - last time: 1 last
length: 10001 ideal: 5 proposed size: 3
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: SYNC 157585
processing
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: no sets need
syncing for this event
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: queue event
8344,148724 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: queue event
8346,157585 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteWorker_event: event 8346,157585
ignored - duplicate
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: UNLISTEN
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: Received event
8344,148724 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: SYNC 148724
processing
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: no sets need
syncing for this event
2010-05-16 20:59:38 UTC ERROR  remoteListenThread_8341: "select
con_origin, con_received,     max(con_seqno) as con_seqno,
  max(con_timestamp) as con_timestamp from "_oxrsin".sl_confirm where
con_received <> 8394 group by con_origin, con_received"
 could not receive data from server: Connection timed out
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8346,157585 received by 8344
2010-05-16 20:59:42 UTC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 112866
2010-05-16 20:59:45 UTC DEBUG2 localListenThread: Received event
8394,112866 SYNC
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8394,112866 received by 8344
2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8344,148724 received by 8346
2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112866 received by 8346
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8346: queue event
8346,157586 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8346: UNLISTEN
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: Received event
8346,157586 SYNC
2010-05-16 20:59:48 UTC DEBUG2 calc sync size - last time: 1 last
length: 10002 ideal: 5 proposed size: 3
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: SYNC 157586
processing
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: no sets need
syncing for this event
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: queue event
8344,148725 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: queue event
8346,157586 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteWorker_event: event 8346,157586
ignored - duplicate
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: UNLISTEN
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: Received event
8344,148725 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: SYNC 148725
processing
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: no sets need
syncing for this event
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8346,157586 received by 8344
2010-05-16 20:59:49 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8344,148725 received by 8346
2010-05-16 20:59:52 UTC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 112867
2010-05-16 20:59:55 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:55 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:55 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112867 received by 8344
2010-05-16 20:59:55 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8394,112867 received by 8346

-- 
Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Slony1-bugs mailing list