[Slony1-bugs] [Slony1-general] An old event not confirmed: A possible bug?

Thu May 20 07:48:26 PDT 2010

Jan Wieck a écrit :
> On 5/20/2010 9:35 AM, Cyril Scetbon wrote:
>   
>> Jan Wieck a écrit :
>>     
>>> On 5/12/2010 10:31 AM, Gurjeet Singh wrote:
>>>   
>>>       
>>>> Hi All,
>>>>
>>>>     I have two Slony test beds which show the exact same symptoms!
>>>>
>>>> select * from sl_event order by ev_seqno;
>>>>
>>>>  ev_origin |  ev_seqno  |        ev_timestamp        |        
>>>> ev_snapshot         | ev_type |
>>>> -----------+------------+----------------------------+----------------------------+---------+-
>>>>          2 | 5000000002 | 2010-04-30 08:32:38.622928 | 
>>>> 458:458:                   | SYNC    |
>>>>          1 | 5000525721 | 2010-05-12 13:30:22.79626  | 
>>>> 72685915:72685915:         | SYNC    |
>>>>          1 | 5000525722 | 2010-05-12 13:30:24.800943 | 
>>>> 72686139:72686139:         | SYNC    |
>>>>          1 | 5000525723 | 2010-05-12 13:30:26.804862 | 
>>>> 72686224:72686224:         | SYNC    |
>>>> ...
>>>>
>>>>     
>>>>         
>>> Slony always keeps at least the last event per origin around. Otherwise 
>>> the view sl_status would not work.
>>>   
>>>       
>> Hi Jan, Can you talk more about it ? I've posted a mail today to 
>> slony1-bugs cause test_slony_state.pl is warning us about old events 
>> (that's exactly the eldest ones). That's a matter for events generated 
>> from the local node. I see events from the local node only when I 
>> restart it :
>>     
>
> I presume that you have set sync_interval_timeout to zero on the 
> subscribers, which will prevent the generation of SYNC events on those 
> nodes because no actual replication work is ever generated there. Looks 
> like test_slony_state.pl depends on that parameter no be non-zero 
> (default is -t 10000, meaning every 10 seconds).
>   
No as you can see :

2010-05-20 15:31:55 CEST CONFIG slon: watchdog ready - pid = 23457
2010-05-20 15:31:55 CEST CONFIG main: Integer option vac_frequency = 3
2010-05-20 15:31:55 CEST CONFIG main: Integer option log_level = 2
2010-05-20 15:31:55 CEST CONFIG main: Integer option sync_interval = 500
2010-05-20 15:31:55 CEST CONFIG main: Integer option 
sync_interval_timeout = 10000
2010-05-20 15:31:55 CEST CONFIG main: Integer option sync_group_maxsize = 20
2010-05-20 15:31:55 CEST CONFIG main: Integer option desired_sync_time = 
60000
2010-05-20 15:31:55 CEST CONFIG main: Integer option syslog = 0
2010-05-20 15:31:55 CEST CONFIG main: Integer option quit_sync_provider = 0
2010-05-20 15:31:55 CEST CONFIG main: Integer option quit_sync_finalsync = 0
2010-05-20 15:31:55 CEST CONFIG main: Integer option sync_max_rowsize = 8192
2010-05-20 15:31:55 CEST CONFIG main: Integer option sync_max_largemem = 
5242880
2010-05-20 15:31:55 CEST CONFIG main: Integer option 
remote_listen_timeout = 300
2010-05-20 15:31:55 CEST CONFIG main: Boolean option log_pid = 0
2010-05-20 15:31:55 CEST CONFIG main: Boolean option log_timestamp = 1
2010-05-20 15:31:55 CEST CONFIG main: Boolean option cleanup_deletelogs = 0
2010-05-20 15:31:55 CEST CONFIG main: Real option real_placeholder = 
0.000000

But this is a receiver and I saw in the code of  function 
generate_sync_event that it does not generate sync interval on a node 
which is not the origin of a set. That's why I presume there is no sync 
created except the one created at startup (mandatory) in syncThread_main :

/*
         * We don't initialize the last known action sequence to the 
actual value.
         * This causes that we create a SYNC event allways on startup, 
just in
         * case.
         */
        last_actseq_buf[0] = '\0';

        /*
         * Build the query that starts a transaction and retrieves the 
last value
         * from the action sequence.
         */
        dstring_init(&query1);
        slon_mkquery(&query1,
                                 "start transaction;"
                                 "set transaction isolation level 
serializable;"
                                 "select last_value from %s.sl_action_seq;",
                                 rtcfg_namespace);

        /*
         * Build the query that calls createEvent() for the SYNC
         */
        dstring_init(&query2);
        slon_mkquery(&query2,
                     "select %s.createEvent('_%s', 'SYNC', NULL)"
                     " from %s.sl_node where no_id = 
%s.getLocalNodeId('_%s') "
                     " and exists (select 1 from %s.sl_set where 
set_origin= no_id);",
                     rtcfg_namespace, rtcfg_cluster_name,
                     rtcfg_namespace, rtcfg_namespace, 
rtcfg_cluster_name, rtcfg_namespace);

>
> Jan
>
>   
>> select * from _OURCLUSTER.sl_event where 
>> ev_origin=102;                                  
>>  ev_origin | ev_seqno |        ev_timestamp        |     
>> ev_snapshot      | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | 
>> ev_data5 | ev_data6 | ev_data7 | ev_data8
>> -----------+----------+----------------------------+----------------------+---------+----------+----------+----------+----------+----------+----------+----------+----------
>>        102 |       51 | 2010-05-20 12:27:00.099562 | 
>> 338318875:338318875: | SYNC    |          |          |          
>> |          |          |          |          |
>> (1 row)
>>
>> select * from _OURCLUSTER.sl_confirm where con_origin=102;
>>  con_origin | con_received | con_seqno |       con_timestamp       
>> ------------+--------------+-----------+----------------------------
>>         102 |          101 |        51 | 2010-05-20 12:27:02.78581
>>         102 |          103 |        51 | 2010-05-20 12:27:00.118815
>>         102 |          104 |        51 | 2010-05-20 12:27:00.253975
>>
>> the SYNC appears in slony logs as "new sl_action_seq 1 - SYNC %d"
>>
>>     
>>> What should worry you is that there are no newer SYNC events from node 2 
>>> available. Slony does create a sporadic SYNC every now and then even if 
>>> there is no activity or the node isn't an origin anyway.
>>>
>>> Is it possible that node 2's clock is way off?
>>>
>>>
>>> Jan
>>>
>>>   
>>>       
>>>> The reason I think this _might_ be a bug is that on both clusters, slave 
>>>> node's sl_event has the exact same record for ev_seqno=5000000002 except 
>>>> for the timestamp; same origin, and same snapshot!
>>>>
>>>> The head of sl_confirm has:
>>>>
>>>>  select * from sl_confirm order by con_seqno;
>>>>
>>>>  con_origin | con_received | con_seqno  |       con_timestamp
>>>> ------------+--------------+------------+----------------------------
>>>>           2 |            1 | 5000000002 | 2010-04-30 08:32:53.974021
>>>>           1 |            2 | 5000527075 | 2010-05-12 14:15:41.192279
>>>>           1 |            2 | 5000527076 | 2010-05-12 14:15:43.193607
>>>>           1 |            2 | 5000527077 | 2010-05-12 14:15:45.196291
>>>>           1 |            2 | 5000527078 | 2010-05-12 14:15:47.197005
>>>> ...
>>>>
>>>> Can someone comment on the health of the cluster? All events, except for 
>>>> that on, are being confirmed and purged from the system regularly, so my 
>>>> assumption is that the cluster is healthy and that the slave is in sync 
>>>> with the master.
>>>>
>>>> Thanks in advance.
>>>> -- 
>>>> gurjeet.singh
>>>> @ EnterpriseDB - The Enterprise Postgres Company
>>>> http://www.enterprisedb.com
>>>>
>>>> singh.gurjeet@{ gmail | yahoo }.com
>>>> Twitter/Skype: singh_gurjeet
>>>>
>>>> Mail sent from my BlackLaptop device
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Slony1-general mailing list
>>>> Slony1-general at lists.slony.info
>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>>     
>>>>         
>>>   
>>>       
>
>
>   

-- 
Cyril SCETBON