[Slony1-general] Slony replication stops right after start Slony replication stops right after start

Fri Feb 24 08:44:19 PST 2012

On 12-02-24 10:41 AM, Ulas Albayrak wrote:

> Hi,
>
> Sorry, seems I forgot to post versions:
>
> Slony: 2.0.7
> PostgreSQL:  9.0.5
>
> I restarted the slon deamon with debug=4 on both nodes and this is what I got:
>
> Origin:
> 2012-02-24 16:20:50 CETDEBUG2 localListenThread: Received event
> 1,5000002765 SYNC
> 2012-02-24 16:20:50 CETDEBUG2 syncThread: new sl_action_seq 45211 -
> SYNC 5000002766

This shows that SYNC events are now being generated on the origin.

> Subscriber:
> 2012-02-24 16:28:53 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:28:53 CETDEBUG2 remoteWorkerThread_1: current local
> log_status is 2
> 2012-02-24 16:28:53 CETDEBUG3 remoteWorkerThread_1: activate helper 1
> 2012-02-24 16:28:53 CETDEBUG4 remoteWorkerThread_1: waiting for log data
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: got work to do
> 2012-02-24 16:28:53 CETDEBUG2 remoteWorkerThread_1_1: current remote
> log_status = 1
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: allocate line buffers
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: fetch from cursor
> 2012-02-24 16:28:53 CETDEBUG1 remoteHelperThread_1_1: 0.002 seconds
> delay for first row
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: fetched 0 log rows
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: return 50 unused
> line buffers
> 2012-02-24 16:28:53 CETDEBUG1 remoteHelperThread_1_1: 0.003 seconds
> until close cursor
> 2012-02-24 16:28:53 CETDEBUG1 remoteHelperThread_1_1: inserts=0
> updates=0 deletes=0

This shows that SYNC events are being processed on the slave and there 
is nothing to do.  I also notice that you have 8 replication sets.  Are 
changes not being replicated to all tables in all the sets or only some 
of the sets?

sl_log_1 and sl_log_2 on the origin should have a record of rows that 
need to be replicated.  If you insert a row into one of your tables you 
should then see that data in sl_log_1 or sl_log_2.  Is this the case.

> 2012-02-24 16:28:53 CETDEBUG1 remoteWorkerThread_1: sync_helper
> timing:  pqexec (s/count)- provider 0.002/3 - subscriber 0.000/3
> 2012-02-24 16:28:53 CETDEBUG1 remoteWorkerThread_1: sync_helper
> timing:  large tuples 0.000/0
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: change helper
> thread status
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: send DONE/ERROR
> line to worker
> 2012-02-24 16:28:53 CETDEBUG3 remoteHelperThread_1_1: waiting for
> workgroup to finish
> 2012-02-24 16:28:53 CETDEBUG3 remoteWorkerThread_1: helper 1 finished
> 2012-02-24 16:28:53 CETDEBUG4 remoteWorkerThread_1: returning lines to pool
> 2012-02-24 16:28:53 CETDEBUG3 remoteWorkerThread_1: all helpers done.
> 2012-02-24 16:28:53 CETDEBUG4 remoteWorkerThread_1: changing helper 1 to IDLE
> 2012-02-24 16:28:53 CETDEBUG2 remoteWorkerThread_1: cleanup
> 2012-02-24 16:28:53 CETDEBUG4 remoteHelperThread_1_1: waiting for work
> 2012-02-24 16:28:53 CETINFO   remoteWorkerThread_1: SYNC 5000002873
> done in 0.014 seconds
> 2012-02-24 16:28:53 CETDEBUG1 remoteWorkerThread_1: SYNC 5000002873
> sync_event timing:  pqexec (s/count)- provider 0.001/1 - subsc
> riber 0.008/1 - IUD 0.001/4
> 2012-02-24 16:28:55 CETDEBUG2 syncThread: new sl_action_seq 1 - SYNC 5000001345
> 2012-02-24 16:28:59 CETDEBUG2 localListenThread: Received event
> 2,5000001345 SYNC
> 2012-02-24 16:29:05 CETDEBUG2 syncThread: new sl_action_seq 1 - SYNC 5000001346
> 2012-02-24 16:29:11 CETDEBUG2 localListenThread: Received event
> 2,5000001346 SYNC
> 2012-02-24 16:29:11 CETDEBUG2 remoteListenThread_1: queue event
> 1,5000002874 SYNC
> 2012-02-24 16:29:11 CETDEBUG2 remoteListenThread_1: queue event
> 1,5000002875 SYNC
> 2012-02-24 16:29:11 CETDEBUG2 remoteListenThread_1: queue event
> 1,5000002876 SYNC
> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: Received event #1
> from 5000002874 type:SYNC
> 2012-02-24 16:29:11 CETDEBUG1 calc sync size - last time: 1 last
> length: 18011 ideal: 3 proposed size: 3
> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: SYNC 5000002876 processing
> 2012-02-24 16:29:11 CETDEBUG1 about to monitor_subscriber_query -
> pulling big actionid list for 1
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 1 with
> 9 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 2 with
> 15 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 3 with
> 4 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 4 with
> 6 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 5 with
> 3 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 6 with
> 4 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 7 with
> 3 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 8 with
> 23 table(s) from provider 1
> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: syncing set 9 with
> 8 table(s) from provider 1

> 2012-02-24 16:29:11 CETDEBUG4  ssy_action_list value:
> 2012-02-24 16:29:11 CETDEBUG2  ssy_action_list length: 0
> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: current local
> log_status is 2
> 2012-02-24 16:29:11 CETDEBUG3 remoteWorkerThread_1: activate helper 1
> 2012-02-24 16:29:11 CETDEBUG4 remoteWorkerThread_1: waiting for log data
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: got work to do
> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1_1: current remote
> log_status = 1
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: allocate line buffers
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: fetch from cursor
> 2012-02-24 16:29:11 CETDEBUG1 remoteHelperThread_1_1: 0.002 seconds
> delay for first row
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: fetched 0 log rows
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: return 50 unused
> line buffers
> 2012-02-24 16:29:11 CETDEBUG1 remoteHelperThread_1_1: 0.002 seconds
> until close cursor
> 2012-02-24 16:29:11 CETDEBUG1 remoteHelperThread_1_1: inserts=0
> updates=0 deletes=0
> 2012-02-24 16:29:11 CETDEBUG1 remoteWorkerThread_1: sync_helper
> timing:  pqexec (s/count)- provider 0.002/3 - subscriber 0.000/3
> 2012-02-24 16:29:11 CETDEBUG1 remoteWorkerThread_1: sync_helper
> timing:  large tuples 0.000/0
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: change helper
> thread status
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: send DONE/ERROR
> line to worker
> 2012-02-24 16:29:11 CETDEBUG3 remoteHelperThread_1_1: waiting for
> workgroup to finish
> 2012-02-24 16:29:11 CETDEBUG3 remoteWorkerThread_1: helper 1 finished
> 2012-02-24 16:29:11 CETDEBUG4 remoteWorkerThread_1: returning lines to pool
> 2012-02-24 16:29:11 CETDEBUG3 remoteWorkerThread_1: all helpers done.
> 2012-02-24 16:29:11 CETDEBUG4 remoteWorkerThread_1: changing helper 1 to IDLE
> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: cleanup
> 2012-02-24 16:29:11 CETDEBUG4 remoteHelperThread_1_1: waiting for work
> 2012-02-24 16:29:11 CETINFO   remoteWorkerThread_1: SYNC 5000002876
> done in 0.014 seconds
> 2012-02-24 16:29:11 CETDEBUG1 remoteWorkerThread_1: SYNC 5000002876
> sync_event timing:  pqexec (s/count)- provider 0.001/1 - subsc
> riber 0.008/1 - IUD 0.001/4
> 2012-02-24 16:29:11 CETDEBUG2 remoteWorkerThread_1: forward confirm
> 2,5000001346 received by 1
>
> New SYNC events are being generated. I can find several rows in the
> origin's sl_event table with ev_origin=$originid. None of them however
> (including rows with ev_origin=$subscribedid) are older than about 15
> minutes, even though the most recent data in the subscriber db is
> several hours old.
>
> I'm not really sure how to interpret the data in the pg_locks view,
> but none of them (neither origin or subscriber) have any rows where
> the "granted" column isn't set to TRUE, if that means anything?
>
> /Ulas
>
>
> On Fri, Feb 24, 2012 at 3:34 PM, Steve Singer<ssinger at ca.afilias.info>  wrote:
>> On 12-02-24 08:21 AM, Ulas Albayrak wrote:
>>
>> You didn't say what version of slony you are using with which version of
>> postgresql.
>>
>> I don't see anything in the logs you posted about the slon for the origin
>> node generating sync events.  At DEBUG2 or higher (at least ons some
>> versions of slony) you should be getting "syncThread: new sl_action_seq %s "
>> type messages in the log for the slon origin.
>>
>> Are new SYNC events being generated in the origin sl_event table with
>> ev_origin=$originid?
>>
>> Many versions of slony require an exclusive lock on sl_event to generate
>> sync events.  Do you have something preventing this?  (ie look in pg_locks
>> to see if the slony sync connection is waiting on a lock).
>>
>>
>>
>>
>>
>>> Hi,
>>>
>>> I have been trying to set up a small Slony cluster (only 2 nodes) for
>>> the last 2 days but I can't get it to work. Everytime I get the same
>>> result: The replication starts of fine. Slony start copying, trying to
>>> get all the tables in the subscribing node up to speed. But somewhere
>>> along the way the 2nd node stops getting updates. Slony replicates all
>>> the data in a specific table up to a specific point in time and then
>>> no more. And this time seems to coincide with when the copying of data
>>> for that specific table started.
>>>
>>> An example to illustrate the scenario:
>>>
>>> Let's say I have set up the whole replication system and then at 12:00
>>> I start the actual replication. Around 12:05 copying of table A from
>>> node 1 to node 2 starts. It finishes but only the data that was
>>> received before  12:05 get copied to node 2. Then at 12:10 copying of
>>> table B starts. Same thing here: Slony copies all the data that was
>>> received before 12:10 to node 2. And this is the same for all tables.
>>>
>>> The logs for the slon deamons show:
>>>
>>> Origin node:
>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=21942
>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=21945
>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>> NOTICE:  Slony-I: Logswitch to sl_log_2 initiated
>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".logswitch_start()"
>>> PL/pgSQL function "cleanupevent" line 101 at PERFORM
>>> 2012-02-24 12:17:39 CETINFO   cleanupThread:    0.019 seconds for
>>> cleanupEvent()
>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=21949
>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=23779
>>> CONTEXT:  SQL statement "SELECT "_fleetcluster".cleanupNodelock()"
>>> PL/pgSQL function "cleanupevent" line 83 at PERFORM
>>>
>>> Subscribing node:
>>> 2012-02-24 13:20:23 CETINFO   remoteWorkerThread_1: SYNC 5000000856
>>> done in 0.012 seconds
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 1 with
>>> 9 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 2 with
>>> 15 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 3 with
>>> 4 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 4 with
>>> 6 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 5 with
>>> 3 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 6 with
>>> 4 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 7 with
>>> 3 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 8 with
>>> 23 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: syncing set 9 with
>>> 8 table(s) from provider 1
>>> 2012-02-24 13:20:41 CETINFO   remoteWorkerThread_1: SYNC 5000000857
>>> done in 0.014 seconds
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 1 with
>>> 9 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 2 with
>>> 15 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 3 with
>>> 4 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 4 with
>>> 6 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 5 with
>>> 3 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 6 with
>>> 4 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 7 with
>>> 3 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 8 with
>>> 23 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: syncing set 9 with
>>> 8 table(s) from provider 1
>>> 2012-02-24 13:20:43 CETINFO   remoteWorkerThread_1: SYNC 5000000858
>>> done in 0.011 seconds
>>>
>>>
>>>
>>> Have anyone experienced this before or have any idea what could be causing
>>> this?
>>>
>>
>
>
>