[Slony1-general] CESTERROR remoteListenThread_1: timeout (300 s) for event selection

Wed May 25 08:37:05 PDT 2011

I've never had an actual problem caused by this. I regularly just 
hibernate my laptop (which I replicate live data to) and when it wakes 
up again, it logs a network error, reconnects and carries on. Very 
reliable and safe in my experience.
If you're not doing something like that and it's over an "always on" 
connection, it might be indicative of some other problem, but slony1 
won't actually mind.
One thing to note however, I had trouble with 2.0.6 *not* reconnecting 
and carrying on, I would have to manually restart the slony1 service; 
2.0.5 however worked fine. I'm using 2.1beta something now, that also 
works fine.

On 25/05/11 16:01, Christopher Browne wrote:
> On Wed, May 25, 2011 at 7:43 AM, Ger Timmens<Ger.Timmens at adyen.com>  wrote:
>> We are replicating a 500Gb database from postgresql 8.3 to
>> postgresql 9.0 using slony1-2.0.6.
>>
>> We got the following error in our slon logs during the copy set of
>> one of the bigger tables:
>>
>> CESTERROR  remoteListenThread_1: timeout (300 s) for event selection
>>
>> The documentation:
>>
>>     ERROR: remoteListenThread_%d: timeout for event selection
>>
>>     This means that the listener thread (src/slon/remote_listener.c)
>> timed out when trying to determine what events were outstanding for it.
>>
>>     This could occur because network connections broke, in which
>> case restarting the slon might help.
>>
>>     Alternatively, this might occur because the slon for this node
>> has been broken for a long time, and there are an enormous number of
>> entries in sl_event on this or other nodes for the node to work
>> through, and it is taking more than slon_conf_remote_listen_timeout
>> seconds to run the query. In older versions of Slony-I, that
>> configuration parameter did not exist; the timeout was fixed at 300
>> seconds. In newer versions, you might increase that timeout in the
>> slon config file to a larger value so that it can continue to
>> completion. And then investigate why nobody was monitoring things
>> such that replication broke for such a long time...
>>
>> Replication seems to continue fine after this error.
>> Is it save to continue ?
>> Or should we start from scratch ?
>> If so what do we have to do to prevent this error from happening again ?
> Well, the documentation indicates that this error tends to come up for
> two reasons:
>
> a) Because there was some sort of network glitch, or
> b) Because some kind of misconfiguration left the cluster behind by
> some stupendous number of events.
>
> I think you encountered a), since the error didn't persist.
>
> As such, I'd chalk it up to "network glitch," and while there may be
> some value in doing a network investigation as to why such things
> might be happening to you, it's not particularly important to
> replication itself.  You shouldn't have any ongoing problem.
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general