[Slony1-general] CESTERROR remoteListenThread_1: timeout (300 s) for event selection

Wed May 25 08:01:32 PDT 2011

On Wed, May 25, 2011 at 7:43 AM, Ger Timmens <Ger.Timmens at adyen.com> wrote:
> We are replicating a 500Gb database from postgresql 8.3 to
> postgresql 9.0 using slony1-2.0.6.
>
> We got the following error in our slon logs during the copy set of
> one of the bigger tables:
>
> CESTERROR  remoteListenThread_1: timeout (300 s) for event selection
>
> The documentation:
>
>    ERROR: remoteListenThread_%d: timeout for event selection
>
>    This means that the listener thread (src/slon/remote_listener.c)
> timed out when trying to determine what events were outstanding for it.
>
>    This could occur because network connections broke, in which
> case restarting the slon might help.
>
>    Alternatively, this might occur because the slon for this node
> has been broken for a long time, and there are an enormous number of
> entries in sl_event on this or other nodes for the node to work
> through, and it is taking more than slon_conf_remote_listen_timeout
> seconds to run the query. In older versions of Slony-I, that
> configuration parameter did not exist; the timeout was fixed at 300
> seconds. In newer versions, you might increase that timeout in the
> slon config file to a larger value so that it can continue to
> completion. And then investigate why nobody was monitoring things
> such that replication broke for such a long time...
>
> Replication seems to continue fine after this error.
> Is it save to continue ?
> Or should we start from scratch ?
> If so what do we have to do to prevent this error from happening again ?

Well, the documentation indicates that this error tends to come up for
two reasons:

a) Because there was some sort of network glitch, or
b) Because some kind of misconfiguration left the cluster behind by
some stupendous number of events.

I think you encountered a), since the error didn't persist.

As such, I'd chalk it up to "network glitch," and while there may be
some value in doing a network investigation as to why such things
might be happening to you, it's not particularly important to
replication itself.  You shouldn't have any ongoing problem.