Tue Sep 21 21:35:41 PDT 2004
- Previous message: [Slony1-general] Replicating complex (?) databases
- Next message: [Slony1-general] Error while running slave slon process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
After watching what sorts of common "bump in the night" scenarios pop up, I'm wanting to set up a smarter sort of "watchdog" script to watch each slon instance to see if it needs to be restarted. At present, there's a script that basically "zaps" the slons every now and again and restarts them. That is not nearly ideal from a couple of perspectives that I can see: 1. It leaves PG backends around waiting for notifications, which causes dead tuples on pg_listener to linger around, and whatever other ills are engendered by "zombie" transactions. 2. Sometimes it causes the slon instances to get a bit deranged such that they need a "restart node". My thought is to have the "watchdog" be smarter in three ways: a) It should only kill the slon if there seems reason to do so. The case where we _definitely_ need it is when a VPN network connection falls down, so that events no longer get through. That suggests looking to see how recently events have made it through. Here's the query I'm thinking of. oxrslive=# select now() - ev_timestamp > '00:20:00'::interval as event_old, now() - ev_timestamp as age, oxrslive-# ev_timestamp, ev_seqno, ev_origin as origin oxrslive-# from _oxrslive.sl_event events, _oxrslive.sl_subscribe slony_master oxrslive-# where oxrslive-# events.ev_origin = slony_master.sub_provider and oxrslive-# not exists (select * from _oxrslive.sl_subscribe providers oxrslive(# where providers.sub_receiver = slony_master.sub_provider and oxrslive(# providers.sub_set = slony_master.sub_set and oxrslive(# slony_master.sub_active = 't' and oxrslive(# providers.sub_active = 't') oxrslive-# order by ev_origin desc, ev_seqno desc limit 1; event_old | age | ev_timestamp | ev_seqno | origin -----------+-----------------+----------------------------+----------+-------- f | 00:00:01.025902 | 2004-09-21 19:16:43.804917 | 621069 | 1 (1 row) It looks for the latest timestamp associated with an event coming from a "master" node, and returns "t" in the first field if the interval since the last event exceeds 20 minutes (which I'm treating as a provisional parameter value). Is there anything particularly deranged about that? Or should I be looking to see which 'active' origin has checked in least recently? b) It should submit a "restart node" if it notices, in the logs: FATAL localListenThread: Another slon daemon is serving this node already Question: How exuberent should it be about this? Tell all the nodes to restart? Or just the offending one? c) If the slon process has died, it should restart it, and probably throw out a "Help! Call a dba!" if this has happened too many times recently. -- let name="cbbrowne" and tld="ca.afilias.info" in String.concat "@" [name;tld];; <http://dev6.int.libertyrms.com/> Christopher Browne (416) 673-4124 (land)
- Previous message: [Slony1-general] Replicating complex (?) databases
- Next message: [Slony1-general] Error while running slave slon process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list