[Slony1-general] Move set terminates slon

Mon Aug 15 15:53:52 PDT 2005

John Sidney-Woollett wrote:

> Can anyone explain why slon on (the master) node #1 stopped when a
> MOVE SET command was issued on that node - see the FATAL error notice
> below. But when slon on node 1 was restarted, it processed the as yet
> unprocessed MOVE SET correctly.
>
> The slonik script below was executed on fs01b, node #1, and this is
> the server where the slon process died. The slon process on db01a,
> node #2 stayed up fine during all the move set operations.
>
> No applications were running against either database during the switch
> over. But I did have one psql session open against each db to check
> that the moves worked OK by issuing SQL statements for the appropriate
> tables after the move.
>
> We had 6 sets to move and only the first move (set #6) didn't
> terminate the slon process. However, all the move sets seem to have
> worked OK.
>
> I checked the _bpreplicate2.sl_set table (on both nodes), and the
> set_origin is now 2, and all the tables are unlocked on node #2. All
> sequences and tables seem to be being replicated correctly (now from
> node #2 to node #1).
>
> I'm using slony 1.1 with postgres 7.4.6
>
> Any ideas?

Yes, I have an idea...

The place where the error message is generated is in local_listen.c;
here's the code fragment that generates it...

                if (PQntuples(res2) != 1)
                {
                    slon_log(SLON_FATAL, "localListenThread: MOVE_SET "
                         "but no provider found for set %d\n",
                         set_id);
                    dstring_free(&query2);
                    PQclear(res2);
                    slon_abort();
                }

That != 1 struck me as suspicious...  If the code were looking for
"non-existence," I'd expect it to look for PQntuples(res2) being zero.

If the query returns more than 1 entry, that code path would be taken,
and, even before looking at the query, that seems suspicious.

Stepping backwards, the query was...

                slon_mkquery(&query2,
                         "select sub_provider from %s.sl_subscribe "
                         "    where sub_receiver = %d",
                         rtcfg_namespace, rtcfg_nodeid);
                res2 = PQexec(dbconn, dstring_data(&query2));

In your case, when the *first* MOVE SET takes place, this will lead to
the receiver being changed for the first set, and thus to there being
one set found, and hence PQntuples(res2) will return 1, and all will
appear OK.

When the subsequent MOVE SETs are performed, there will be multiple
records found with the revised receiver, and the SLON_FATAL error will
be raised each time.

It ought to be simple enough to add the set_id into the query, which
would resolve the issue.  I'll hold off on that until we can get the new
testing framework checked in, so that I can test using the new
framework.  (Hint, hint, Darcy!)

The problem introduced by this bug is basically that the attempt to
reconfigure the slon after the event fails.  Restarting the slon is the
right answer, and will work fine.  Your watchdog process is your friend,
in this case :-).