[Slony1-general] Slony replication stops right after start Slony replication stops right after start

Mon Feb 27 02:58:56 PST 2012

Hi,

None of the tables in any of the sets are being replicated (there are
9 of them in total by the way). The most recent data in all of them is
from within half an hour of each other.

The s_log_1 and s_log_2 tables on the provider node are currently
holding around 10k rows between the two of them. Nothing older that
1,5 hours. Does this mean that Slony thinks everything older than that
has already been replicated?

I don't know if it matters but maybe I should mention that the two
machines in the cluster were previously in a cluster where the roles
were reversed, i.e. the subscriber used to be the provider and vice
versa. After a system failure and a subsequent failover I cleaned the
whole cluster (drop node and uninstall node) before attempting to set
it up again. I can't help to think that there maybe some lingering
configurations that are getting in the way?

/Ulas

On Fri, Feb 24, 2012 at 5:44 PM, Steve Singer <ssinger at ca.afilias.info> wrote:
> On 12-02-24 10:41 AM, Ulas Albayrak wrote:
>
>
>
>> Hi,
>>
>> Sorry, seems I forgot to post versions:
>>
>> Slony: 2.0.7
>> PostgreSQL:  9.0.5
>>
>> I restarted the slon deamon with debug=4 on both nodes and this is what I
>> got:
>>
>> Origin:
>> 2012-02-24 16:20:50 CETDEBUG2 localListenThread: Received event
>> 1,5000002765 SYNC
>> 2012-02-24 16:20:50 CETDEBUG2 syncThread: new sl_action_seq 45211 -
>> SYNC 5000002766
>
>
> This shows that SYNC events are now being generated on the origin.
>
>
>
>> Subscriber:
>> 2012-02-24 16:28:53 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:28:53 CETDEBUG2 remoteWorkerThread_1: current local
>> log_status is 2
>> 2012-02-24 16:28:53 CETDEBUG3 remoteWorkerThread_1: activate helper 1
>> 2012-02-24 16:28:53 CETDEBUG4 remoteWorkerThread_1: waiting for log data
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: got work to do
>> 2012-02-24 16:28:53 CETDEBUG2 remoteWorkerThread_1_1: current remote
>> log_status = 1
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: allocate line
>> buffers
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: fetch from cursor
>> 2012-02-24 16:28:53 CETDEBUG1 remoteHelperThread_1_1: 0.002 seconds
>> delay for first row
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: fetched 0 log rows
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: return 50 unused
>> line buffers
>> 2012-02-24 16:28:53 CETDEBUG1 remoteHelperThread_1_1: 0.003 seconds
>> until close cursor
>> 2012-02-24 16:28:53 CETDEBUG1 remoteHelperThread_1_1: inserts=0
>> updates=0 deletes=0
>
>
> This shows that SYNC events are being processed on the slave and there is
> nothing to do.  I also notice that you have 8 replication sets.  Are changes
> not being replicated to all tables in all the sets or only some of the sets?
>
> sl_log_1 and sl_log_2 on the origin should have a record of rows that need
> to be replicated.  If you insert a row into one of your tables you should
> then see that data in sl_log_1 or sl_log_2.  Is this the case.
>
>
>
>
>> 2012-02-24 16:28:53 CETDEBUG1 remoteWorkerThread_1: sync_helper
>> timing:  pqexec (s/count)- provider 0.002/3 - subscriber 0.000/3
>> 2012-02-24 16:28:53 CETDEBUG1 remoteWorkerThread_1: sync_helper
>> timing:  large tuples 0.000/0
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: change helper
>> thread status
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: send DONE/ERROR
>> line to worker
>> 2012-02-24 16:28:53 CETDEBUG3 remoteHelperThread_1_1: waiting for
>> workgroup to finish
>> 2012-02-24 16:28:53 CETDEBUG3 remoteWorkerThread_1: helper 1 finished
>> 2012-02-24 16:28:53 CETDEBUG4 remoteWorkerThread_1: returning lines to
>> pool
>> 2012-02-24 16:28:53 CETDEBUG3 remoteWorkerThread_1: all helpers done.
>> 2012-02-24 16:28:53 CETDEBUG4 remoteWorkerThread_1: changing helper 1 to
>> IDLE
>> 2012-02-24 16:28:53 CETDEBUG2 remoteWorkerThread_1: cleanup
>> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: waiting for work
>> 2012-02-24 16:28:53 CETINFO   remoteWorkerThread_1: SYNC 5000002873
>> done in 0.014 seconds
>> 2012-02-24 16:28:53 CETDEBUG1 remoteWorkerThread_1: SYNC 5000002873
>> sync_event timing:  pqexec (s/count)- provider 0.001/1 - subsc
>> riber 0.008/1 - IUD 0.001/4
>> 2012-02-24 16:28:55 CETDEBUG2 syncThread: new sl_action_seq 1 - SYNC
>> 5000001345
>> 2012-02-24 16:28:59 CETDEBUG2 localListenThread: Received event
>> 2,5000001345 SYNC
>> 2012-02-24 16:29:05 CETDEBUG2 syncThread: new sl_action_seq 1 - SYNC
>> 5000001346
>> 2012-02-24 16:29:11 CETDEBUG2 localListenThread: Received event
>> 2,5000001346 SYNC
>> 2012-02-24 16:29:11 CETDEBUG2 remoteListenThread_1: queue event
>> 1,5000002874 SYNC
>> 2012-02-24 16:29:11 CETDEBUG2 remoteListenThread_1: queue event
>> 1,5000002875 SYNC
>> 2012-02-24 16:29:11 CETDEBUG2 remoteListenThread_1: queue event
>> 1,5000002876 SYNC
>> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: Received event #1
>> from 5000002874 type:SYNC
>> 2012-02-24 16:29:11 CETDEBUG1 calc sync size - last time: 1 last
>> length: 18011 ideal: 3 proposed size: 3
>> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: SYNC 5000002876
>> processing
>> 2012-02-24 16:29:11 CETDEBUG1 about to monitor_subscriber_query -
>> pulling big actionid list for 1
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 1 with
>> 9 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 2 with
>> 15 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 3 with
>> 4 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 4 with
>> 6 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 5 with
>> 3 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 6 with
>> 4 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 7 with
>> 3 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 8 with
>> 23 table(s) from provider 1
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 9 with
>> 8 table(s) from provider 1
>
>
>
>
>> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
>> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
>> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: current local
>> log_status is 2
>> 2012-02-24 16:29:11 CETDEBUG3 remoteWorkerThread_1: activate helper 1
>> 2012-02-24 16:29:11 CETDEBUG4 remoteWorkerThread_1: waiting for log data
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: got work to do
>> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1_1: current remote
>> log_status = 1
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: allocate line
>> buffers
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: fetch from cursor
>> 2012-02-24 16:29:11 CETDEBUG1 remoteHelperThread_1_1: 0.002 seconds
>> delay for first row
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: fetched 0 log rows
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: return 50 unused
>> line buffers
>> 2012-02-24 16:29:11 CETDEBUG1 remoteHelperThread_1_1: 0.002 seconds
>> until close cursor
>> 2012-02-24 16:29:11 CETDEBUG1 remoteHelperThread_1_1: inserts=0
>> updates=0 deletes=0
>> 2012-02-24 16:29:11 CETDEBUG1 remoteWorkerThread_1: sync_helper
>> timing:  pqexec (s/count)- provider 0.002/3 - subscriber 0.000/3
>> 2012-02-24 16:29:11 CETDEBUG1 remoteWorkerThread_1: sync_helper
>> timing:  large tuples 0.000/0
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: change helper
>> thread status
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: send DONE/ERROR
>> line to worker
>> 2012-02-24 16:29:11 CETDEBUG3 remoteHelperThread_1_1: waiting for
>> workgroup to finish
>> 2012-02-24 16:29:11 CETDEBUG3 remoteWorkerThread_1: helper 1 finished
>> 2012-02-24 16:29:11 CETDEBUG4 remoteWorkerThread_1: returning lines to
>> pool
>> 2012-02-24 16:29:11 CETDEBUG3 remoteWorkerThread_1: all helpers done.
>> 2012-02-24 16:29:11 CETDEBUG4 remoteWorkerThread_1: changing helper 1 to
>> IDLE
>> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: cleanup
>> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: waiting for work
>> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: SYNC 5000002876
>> done in 0.014 seconds
>> 2012-02-24 16:29:11 CETDEBUG1 remoteWorkerThread_1: SYNC 5000002876
>> sync_event timing:  pqexec (s/count)- provider 0.001/1 - subsc
>> riber 0.008/1 - IUD 0.001/4
>> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: forward confirm
>> 2,5000001346 received by 1
>>
>> New SYNC events are being generated. I can find several rows in the
>> origin's sl_event table with ev_origin=$originid. None of them however
>> (including rows with ev_origin=$subscribedid) are older than about 15
>> minutes, even though the most recent data in the subscriber db is
>> several hours old.
>>
>> I'm not really sure how to interpret the data in the pg_locks view,
>> but none of them (neither origin or subscriber) have any rows where
>> the "granted" column isn't set to TRUE, if that means anything?
>>
>> /Ulas
>>
>>
>> On Fri, Feb 24, 2012 at 3:34 PM, Steve Singer<ssinger at ca.afilias.info>
>>  wrote:
>>>
>>> On 12-02-24 08:21 AM, Ulas Albayrak wrote:
>>>
>>> You didn't say what version of slony you are using with which version of
>>> postgresql.
>>>
>>> I don't see anything in the logs you posted about the slon for the origin
>>> node generating sync events.  At DEBUG2 or higher (at least ons some
>>> versions of slony) you should be getting "syncThread: new sl_action_seq
>>> %s "
>>> type messages in the log for the slon origin.
>>>
>>> Are new SYNC events being generated in the origin sl_event table with
>>> ev_origin=$originid?
>>>
>>> Many versions of slony require an exclusive lock on sl_event to generate
>>> sync events.  Do you have something preventing this?  (ie look in
>>> pg_locks
>>> to see if the slony sync connection is waiting on a lock).
>>>
>>>
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> I have been trying to set up a small Slony cluster (only 2 nodes) for
>>>> the last 2 days but I can't get it to work. Everytime I get the same
>>>> result: The replication starts of fine. Slony start copying, trying to
>>>> get all the tables in the subscribing node up to speed. But somewhere
>>>> along the way the 2nd node stops getting updates. Slony replicates all
>>>> the data in a specific table up to a specific point in time and then
>>>> no more. And this time seems to coincide with when the copying of data
>>>> for that specific table started.
>>>>
>>>> An example to illustrate the scenario:
>>>>
>>>> Let's say I have set up the whole replication system and then at 12:00
>>>> I start the actual replication. Around 12:05 copying of table A from
>>>> node 1 to node 2 starts. It finishes but only the data that was
>>>> received before  12:05 get copied to node 2. Then at 12:10 copying of
>>>> table B starts. Same thing here: Slony copies all the data that was
>>>> received before 12:10 to node 2. And this is the same for all tables.
>>>>
>>>> The logs for the slon deamons show:
>>>>
>>>> Origin node:
>>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=21942
>>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=21945
>>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>>> NOTICE:  Slony-I: Logswitch to sl_log_2 initiated
>>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".logswitch_start()"
>>>> PL/pgSQL function "cleanupevent" line 101 at PERFORM
>>>> 2012-02-24 12:17:39 CETINFO   cleanupThread:    0.019 seconds for
>>>> cleanupEvent()
>>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=21949
>>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=23779
>>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>>>
>>>> Subscribing node:
>>>> 2012-02-24 13:20:23 CETINFO   remoteWorkerThread_1: SYNC 5000000856
>>>> done in 0.012 seconds
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 1 with
>>>> 9 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 2 with
>>>> 15 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 3 with
>>>> 4 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 4 with
>>>> 6 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 5 with
>>>> 3 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 6 with
>>>> 4 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 7 with
>>>> 3 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 8 with
>>>> 23 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 9 with
>>>> 8 table(s) from provider 1
>>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: SYNC 5000000857
>>>> done in 0.014 seconds
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 1 with
>>>> 9 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 2 with
>>>> 15 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 3 with
>>>> 4 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 4 with
>>>> 6 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 5 with
>>>> 3 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 6 with
>>>> 4 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 7 with
>>>> 3 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 8 with
>>>> 23 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 9 with
>>>> 8 table(s) from provider 1
>>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: SYNC 5000000858
>>>> done in 0.011 seconds
>>>>
>>>>
>>>>
>>>> Have anyone experienced this before or have any idea what could be
>>>> causing
>>>> this?
>>>>
>>>
>>
>>
>>
>

-- 
Ulas Albayrak
ulas.albayrak at gmail.com