[Slony1-general] Slony stops replicating during nightly periodic + small patch

Tue Oct 12 17:57:10 PDT 2004

Hi,

First of all, many thanks for the great work on slony!

I use slony 1.0.2 to replicate two Postgresql 7.4.3 databases running on 
FreeBSD 5.2.1-p9, and see that slony stops replicating every night (with a 
couple minor exceptions) during the periodic process that does the backups, 
vacuuming, etc. I use the standard 502.pgsql script that comes with the 
postgresql port on FreeBSD (not quite sure whether it's part of the port or 
the original source tree of Postgresql), which basically does a pg_dump and 
a vacuum analyze.

Every night, I get this on stdout from slon:
ERROR  remoteListenThread_1: timeout for event selection
And this on stderr:
sched_mainloop: select(): Bad file descriptor

Setting debug level to 4 does not give much more information, just says 
after the timeout that the remoteListenThread is done.

Trying to figure out the whole scheduling mechanism, I found this little 
issue: in scheduler.c, a temporary copy of the fdsets for select is made 
first, and then some checks are done to remove some FDs which may not be 
needed any more from the global fdsets. I believe this must be an 
oversight, and is the reason for the select error, which in turn sets 
sched_status to an error value, and causes sched_msleep to return with an 
error value and the remote listener thread to stop.

I moved the copy further down (just before the select) and last night slony 
did not stop replicating even though it logged several of the "timeout for 
event selection" errors. Probably should wait a couple more periodic runs 
to claim victory, but I believe the patch should at the very least not 
cause any problems and solve a few, so here it is (including a couple of 
typo fixes):

%diff -u scheduler.c.orig scheduler.c

--- scheduler.c.orig    Mon Oct 11 17:00:30 2004
+++ scheduler.c Tue Oct 12 18:54:09 2004
@@ -452,21 +452,8 @@
                 struct timeval  timeout;

                 /*
-                * Make copies of the file descriptor sets for select(2)
-                */
-               FD_ZERO(&rfds);
-               FD_ZERO(&wfds);
-               for (i = 0; i < sched_numfd; i++)
-               {
-                       if (FD_ISSET(i, &sched_fdset_read))
-                               FD_SET(i, &rfds);
-                       if (FD_ISSET(i, &sched_fdset_write))
-                               FD_SET(i, &wfds);
-               }
-
-               /*
                  * Check if any of the connections in the wait queue
-                * have reached there timeout. While doing so, we also
+                * have reached their timeout. While doing so, we also
                  * remember the closest timeout in the future.
                  */
                 tv = NULL;
@@ -560,6 +547,19 @@
                 }

                 /*
+                * Make copies of the file descriptor sets for select(2)
+                */
+               FD_ZERO(&rfds);
+               FD_ZERO(&wfds);
+               for (i = 0; i < sched_numfd; i++)
+               {
+                       if (FD_ISSET(i, &sched_fdset_read))
+                               FD_SET(i, &rfds);
+                       if (FD_ISSET(i, &sched_fdset_write))
+                               FD_SET(i, &wfds);
+               }
+
+               /*
                  * Do the select(2) while unlocking the master lock.
                  */
                 pthread_mutex_unlock(&sched_master_lock);
@@ -776,7 +776,7 @@


  /* ----------
- * sched_add_fdset
+ * sched_remove_fdset
   *
   *     Remove a file descriptor from one of the global scheduler sets and
   *     adjust sched_numfd accordingly.

Hope that helps,

Jacques.