Wed Apr 29 11:58:29 PDT 2009
- Previous message: [Slony1-general] [Slony 2.0.1] Events and Confirmations not propagating
- Next message: [Slony1-general] hyphen in slony cluster names
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Any help would be great ... Cyril Scetbon wrote: > I'm still investigating .... > > The slon restart does not resolve the error. After the restart I have > something like : > > select ev_origin, min(ev_seqno), max(ev_seqno), > date_trunc('minutes', min(now() - ev_timestamp)), > date_trunc('minutes', max(now() - ev_timestamp)), > min(now() - ev_timestamp) > '00:30:00' as agehi > from _pns_slony_voila_archi_0.sl_event group by ev_origin; > ev_origin | min | max | date_trunc | date_trunc | agehi > -----------+---------+---------+------------+------------+------- > 103 | 65 | 66 | 00:01:00 | 00:47:00 | f > 104 | 60 | 61 | 00:01:00 | 00:47:00 | f > 102 | 33 | 34 | 00:01:00 | 00:47:00 | f > 101 | 4541581 | 4542350 | 00:00:00 | 00:13:00 | f > (4 rows) > > So, another event is generated (for example, 66 from node 103) > > log extract : > 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 33 > processing > 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_102: no sets need > syncing for this event > 2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event > 102,33 SYNC > 2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 102,33 > ignored - duplicate > 2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event > 104,60 SYNC > 2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 104,60 > ignored - duplicate > 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_104: SYNC 60 > processing > 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_104: no sets need > syncing for this event > 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_103: SYNC 65 > processing > 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_103: no sets need > syncing for this event > > after a few minutes I have : > > select ev_origin, min(ev_seqno), max(ev_seqno), > date_trunc('minutes', min(now() - ev_timestamp)), > date_trunc('minutes', max(now() - ev_timestamp)), > min(now() - ev_timestamp) > '00:30:00' as agehi > from _pns_slony_voila_archi_0.sl_event group by ev_origin; > ev_origin | min | max | date_trunc | date_trunc | agehi > -----------+---------+---------+------------+------------+------- > 103 | 66 | 66 | 00:12:00 | 00:12:00 | f > 104 | 61 | 61 | 00:12:00 | 00:12:00 | f > 102 | 34 | 34 | 00:12:00 | 00:12:00 | f > 101 | 4542310 | 4543008 | 00:00:00 | 00:11:00 | f > > log extract : > 2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_103: SYNC 66 > processing > 2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_103: no sets need > syncing for this event > 2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_104: SYNC 61 > processing > 2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_104: no sets need > syncing for this event > 2009-04-28 16:37:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 34 > processing > 2009-04-28 16:37:53 CEST DEBUG1 remoteWorkerThread_102: no sets need > syncing for this event > > Precedent events (65,60 and 33) have been cleared, and a new one per > node has been generated. But, the error continues and after 30 minutes > I have : > > select ev_origin, min(ev_seqno), max(ev_seqno), > date_trunc('minutes', min(now() - ev_timestamp)), > date_trunc('minutes', max(now() - ev_timestamp)), > min(now() - ev_timestamp) > '00:30:00' as agehi > from _pns_slony_voila_archi_0.sl_event group by ev_origin; > ev_origin | min | max | date_trunc | date_trunc | agehi > -----------+---------+---------+------------+------------+------- > 103 | 66 | 66 | 00:41:00 | 00:41:00 | t > 104 | 61 | 61 | 00:41:00 | 00:41:00 | t > 102 | 34 | 34 | 00:41:00 | 00:41:00 | t > 101 | 4543634 | 4544715 | 00:00:00 | 00:18:00 | f > > In Summary : > > Events are not correctly propagated from nodes 102, 103 and 104. If I > restart every slons processes new events are generated on these nodes > but still not correctly propagated. However the last event on each > node seems to be propagated thanks to the restart stage. > > Any idea ? > > Cyril Scetbon wrote: >> Hi, >> >> I found in slony log files that the worker has restarted the process >> on each host in my configuration (1 provider, 3 receivers) at the >> same time : >> >> 2009-04-28 08:47:37 CEST INFO remoteWorkerThread_101: SYNC 4516360 >> sync_event timing: pqexec (s/count)- provider 0.001/1 - subscriber >> 0.035/1 - IUD 0.000/2 >> 2009-04-28 08:47:38 CEST ERROR remoteListenThread_102: "select >> ev_origin, ev_seqno, ev_timestamp, ev_snapshot, >> "pg_catalog".txid_snapshot_xmin(ev_snapshot), >> "pg_catalog".txid_snapshot_xmax(ev_snapshot), ev_type, >> ev_data1, ev_data2, ev_data3, ev_data4, ev_data5, >> ev_data6, ev_data7, ev_data8 from >> "_pns_slony_voila_archi_0".sl_event e where (e.ev_origin = '102' and >> e.ev_seqno > '29') or (e.ev_origin = '103' and e.ev_seqno > '62') >> order by e.ev_origin, e.ev_seqno limit 40" - FATAL: terminating >> connection due to administrator command >> 2009-04-28 08:47:38 CEST ERROR remoteListenThread_102: "select >> con_origin, con_received, max(con_seqno) as con_seqno, >> max(con_timestamp) as con_timestamp from >> "_pns_slony_voila_archi_0".sl_confirm where con_received <> 104 group >> by con_origin, con_received" 2009-04-28 08:47:38 CEST CONFIG slon: >> child terminated status: 11; pid: 28152, current worker pid: 28152 >> 2009-04-28 08:47:38 CEST CONFIG slon: restart of worker in 10 seconds >> >> The origin of the first error is that postgresql has been restarted >> on one receiver without stopping slon before. Is it a known source of >> errors ? >> The slon process has been terminated with status=0 on the receiver >> where postgresql has been restarted and has segfault (chid status=11) >> on the others (each watchdog has restart a new slon process). This >> fact seems to cause the error with events and confirmations. >> >> Regards. >> >> Cyril Scetbon wrote: >>> A simple restart of all slon processes seems to have resolved the >>> issue. weird .... >>> >>> Cyril Scetbon wrote: >>>> I use test_slony_state and see some informations like : >>>> >>>> Check of event info >>>> --------------------------------------------------- >>>> Problem : Events not propagating to node 2 >>>> Problem : Events not propagating to node 4 >>>> Problem : Events not propagating to node 3 >>>> >>>> Check of sl_confirm aging >>>> --------------------------------------------------- >>>> Confirmations not propagating from 2 to 1 >>>> Confirmations not propagating from 2 to 3 >>>> Confirmations not propagating from 2 to 4 >>>> Confirmations not propagating from 3 to 1 >>>> Confirmations not propagating from 3 to 2 >>>> Confirmations not propagating from 3 to 4 >>>> Confirmations not propagating from 4 to 1 >>>> Confirmations not propagating from 4 to 2 >>>> Confirmations not propagating from 4 to 3 >>>> >>>> You can see the results on one of my databases : >>>> >>>> - for Confirmations >>>> >>>> select con_origin, con_received, min(con_seqno) as minseq, >>>> max(con_seqno) as maxseq, date_trunc('minutes', >>>> min(now()-con_timestamp)) as age1, date_trunc('minutes', >>>> max(now()-con_timestamp)) as age2, min(now() - con_timestamp) > >>>> '00:30:00' as tooold from _pns_slony_voila_archi_0.sl_confirm group >>>> by con_origin, con_received order by con_origin, con_received; >>>> con_origin | con_received | minseq | maxseq | age1 | age2 >>>> | tooold >>>> ------------+--------------+---------+---------+----------+----------+-------- >>>> >>>> 101 | 102 | 4464029 | 4464792 | 00:00:00 | 00:16:00 >>>> | f >>>> 101 | 103 | 4464027 | 4464792 | 00:00:00 | 00:16:00 >>>> | f >>>> 101 | 104 | 4464024 | 4464792 | 00:00:00 | 00:16:00 >>>> | f >>>> 102 | 101 | 29 | 29 | 03:39:00 | 03:39:00 >>>> | t >>>> 102 | 103 | 29 | 29 | 03:39:00 | 03:39:00 >>>> | t >>>> 102 | 104 | 29 | 29 | 03:39:00 | 03:39:00 >>>> | t >>>> 103 | 101 | 62 | 62 | 03:39:00 | 03:39:00 >>>> | t >>>> 103 | 102 | 62 | 62 | 03:39:00 | 03:39:00 >>>> | t >>>> 103 | 104 | 62 | 62 | 03:39:00 | 03:39:00 >>>> | t >>>> 104 | 101 | 57 | 57 | 03:39:00 | 03:39:00 >>>> | t >>>> 104 | 102 | 57 | 57 | 03:39:00 | 03:39:00 >>>> | t >>>> 104 | 103 | 57 | 57 | 03:39:00 | 03:39:00 >>>> | t >>>> >>>> >>>> - for Events >>>> >>>> select ev_origin, min(ev_seqno), max(ev_seqno), >>>> date_trunc('minutes', min(now() - ev_timestamp)), >>>> date_trunc('minutes', max(now() - ev_timestamp)), >>>> min(now() - ev_timestamp) > '00:30:00' as agehi >>>> from _pns_slony_voila_archi_0.sl_event group by ev_origin; >>>> ev_origin | min | max | date_trunc | date_trunc | agehi >>>> -----------+---------+---------+------------+------------+------- >>>> 103 | 62 | 62 | 03:49:00 | 03:49:00 | t >>>> 104 | 57 | 57 | 03:49:00 | 03:49:00 | t >>>> 102 | 29 | 29 | 03:49:00 | 03:49:00 | t >>>> 101 | 4464493 | 4465346 | 00:00:00 | 00:14:00 | f >>>> >>>> >>>> What can be the source of these errors and how to track them ? FYI, >>>> I have logs but in debug level 1. >>>> >>>> I did not have issues in 1.2.15. >>>> >>>> Regards. >>> >> > -- Cyril SCETBON
- Previous message: [Slony1-general] [Slony 2.0.1] Events and Confirmations not propagating
- Next message: [Slony1-general] hyphen in slony cluster names
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list