[Slony1-general] slony archive log monitoring questions.

Tue Jun 1 16:24:05 PDT 2010

Hi all,

    I have a two node slony cluster and I have the slon daemon on the 
slave node to run with the -a command. I'm attempting to better 
understand how the logs work for slony log shipping, and have noticed a 
bit of "odd" behavior (perhaps just odd to me) that perhaps someone can 
explain to me. It's not just the log shipping that I'd like to better 
understand, but just how the slony slave and master communicate with 
eachother and make sure that the slave is in sync with the master, and 
receiving ALL data that it needs without missing any.

In each of the archive logs created by the slony slave with the -a 
command, it looks something like this:

---------------------------------------

------------------------------------------------------------------
-- Slony-I log shipping archive
-- Node 1, Event 555
------------------------------------------------------------------
start transaction;
select "_slony".archiveTracking_offline('22', '2010-06-01 12:53:11.214886');
-- end of log archiving header
------------------------------------------------------------------
-- start of Slony-I data
------------------------------------------------------------------
select "_slony".sequenceSetValue_offline(1,'38');

------------------------------------------------------------------
-- End Of Archive Log
------------------------------------------------------------------
commit;
vacuum analyze "_slony ".sl_archive_tracking;

---------------------------------------

Obviously, this log has no actual replication data in it, however there 
are two main things that I notice in this. First is the Event number, 
second is the number which is in the select statement for 
archiveTracking_offline (which is the same number as what is in the name 
of this particular file, slony1_log_2_00000000000000000022.sql).

For the slony log shipping to work, I understand that each of the log 
files are required in order, but I've noticed that looking through the 
actual files, the Events themselves sometimes skip a number, or several. 
Example,

---------------------------------------
# cat /path/to/slon_archive_logs/* | grep Event
-- Node 1, Event 554
-- Node 1, Event 555
-- Node 1, Event 556
-- Node 1, Event 558
-- Node 1, Event 559
---------------------------------------

Looking at this, event 557 is missing, however the numbering of the 
archive logs is not broken, each log appears with the expected numbering 
in the name. This happens often, and I originally thought that this was 
due to the log that contains the previous event (in this case 556) would 
contain the data for both events, and 557 would simply not appear. I've 
looked through every single archive log, and the event does not appear 
in any of them, nor does it appear later down the road.

I then thought that this event could be an event that isn't a sync, but 
rather perhaps something else that wouldn't make it into these archive 
logs (this might still bet he case). However, shortly after seeing that 
this event is missing, i took a look at the sl_event table and saw that 
the event is indeed there and is indeed a SYNC:

---------------------------------------
postgres=# select ev_seqno, ev_type from _slony.sl_event where ev_origin 
= '1';
 ev_seqno | ev_type
----------+---------
  554 | SYNC
  555 | SYNC
  556 | SYNC
  557 | SYNC
  558 | SYNC
  559 | SYNC
---------------------------------------

Basically what I want to do is write up a little script that will alert 
me via email if something goes wonky with slony replication. I had an 
event recently where data was missing from the slony slave, and all the 
searching I could do came up with showed that the data was never 
replicated, but slony never reported errors (this is in a recent email 
to this mailing list). It would be easy to raise an alert that says 
"woah event 557 was not found", however if it is normal behavior for 
events to be missing like this, then that wouldn't be a good approach to 
take.

I've seen there are a couple of slony monitoring tools, and I'll be 
checking them out to see if they offer anything that I could use. But 
any other suggestions, or even some clarification as to how some of this 
works would be greatly appreciated. If the situation happens again where 
my slony slave is missing data, I'd like a bit of logs to review and see 
when something may have gone wrong, even if i have to generate these 
logs myself with some sort of monitoring.

Thanks in advance,
    Brian F