Steve Singer ssinger at ca.afilias.info
Mon May 7 12:22:58 PDT 2012
On 12-05-04 05:46 PM, Richard Yen wrote:
> On Wed, May 2, 2012 at 2:39 PM, Steve Singer <ssinger at ca.afilias.info
> <mailto:ssinger at ca.afilias.info>> wrote:
>
>     Are any of the above possible:
>
>     1. You had multiple slon daemons writing to the same log archive
>     directory (maybe for different clusters?)
>
> We have several clusters writing to a directory, but there's a separate
> directory for each cluster.  For example:
>
> /home/log_ship
> /home/log_ship/cluster1/new_logfiles
> /home/log_ship/cluster2/new_logfiles
> /home/log_ship/cluster3/new_logfiles
> ...etc...
> We don't have two daemons writing to the same directory.
>
>
>     2.  The mechanism you used for copying the .sql files could have
>     caused processes to try to write to the same file on the destination
>     machine
>
> I'm fairly certain this is not the case.  The files that I sent you were
> directly from the origin machine, not from the destination machine.  Our
> scheme is like this:
>
> Node1 is origin
> Node2 is subscriber, with -a mode, writing files to
> /home/log_ship/cluster1/new_logfiles
> Cronjob moves files from /home/log_ship/cluster1/new_logfiles to
> /home/log_ship/cluster1/log_staging (we filter out the *.sql.tmp files
> so that we can let them finish writing before we move them)
> RemoteNode makes rsync connection to Node2 and copies the files from
> Node2/home/log_ship/cluster1/log_staging to its local directory
> Log files are replayed
>
>     If the answer to both of those is no then maybe there is a bug in
>     how archive file numbers are assigned in remote_worker.c:archive_open.
>     We don't YET see any obvious faults with this logic but if this
>     logic somehow assigned 2 slon worker threads the same id then you
>     could get a file like you sent us.
>
>
> As I look at the files you sent me, I only see differences between the
> third (Node X, Event XXXXXX) and seventh
> (archiveTracking_offline(xxx,'xxxx-xx-xx xx:xx:xx')) lines.  I noticed
> that Node number can vary per file, but only one daemon has the -a
> option enabled.  Not sure why the node number changes--shouldn't it
> always correspond to the node number of the daemon with the -a option
> turned on?
>

The node that has the -a option on its slon is the node number that 
shows up in the file name.  Ie slony1_log_3_XXXXXXXXX.sql is a log file 
generated by slon # 3.

slon 3 has a remote_worker for each of the remote nodes.  These 
remote_worker threads run concurrently.  Each one will generate a 
tracking file for SYNC events from its remote_worker.

The archive sequence numbers are supposed to be assigned on the node the 
slon is for (node 3 is this case).  So two remote_worker threads inside 
of slon 3 SHOULDN'T ever get the same archive counter number (we should 
be serializing on an update of the archive_counter table), thus they 
shouldn't be writing to the same file.   One theory is that some of the 
"shouldnt's" are actually happening (for reasons we haven't determined)





> Aside from that, I tried poking around the sl_* tables that I had
> dumped, but didn't really find anything.  One thing is certain,
> though--a given DML statement shows up in sl_log_x only once, even
> though it shows up several times in the various logship files that are
> generated.  I can't seem to find the corresponding sl_event row, so I'm
> not sure if there might be anything in that direction, in terms of
> duplicated events.
>
> --Richard



More information about the Slony1-general mailing list