[Slony1-general] how to do maintenance on a slave server?

Thu Jul 31 11:30:49 PDT 2008

Alan Hodgson <ahodgson at simkin.ca> writes:
> On Wednesday 30 July 2008, "Stefan Murphy" <stefan at vocalocity.com> wrote:
>> After doing this work we found errors in the slave's log.  Inserts
>> failing because of duplicate values in primary key (non-Slony tables). 
>> Slony would hang on these errors. 
>
> This statement is confusing. If they're non-Slony tables then how could 
> replication have any issues with them?

Confusing indeed.

Slony-I sets up a connection lock so that only 1 slon can be managing
a particular node at a time, and then processes everything within
tranactions against both provider and subscriber, so it should cope
gracefully with any of these failures:

   If you shut down the [subscriber DB] while the slon is applying
   changes, you'll have an uncommitted transaction that will be rolled
   back.  No damage done.

Indeed, that should characterize things pretty well for a number of
sorts of failure modes other than [subscriber DB].  For instance, the
statement should continue to be valid if we replace [subscriber DB]
with:

- [slon process]
- [provider DB]

And it shouldn't be invalidated by the failure being anywhere in the
following set:
- Killing the slon process;
- Stopping the subscriber backend process by killing it cleanly;
- Stopping the subscriber backend process by killing it uncleanly, thereby
  requiring crash recovery when the postmaster restarts;
- Stopping the provider backend process by killing it cleanly;
- Stopping the provider backend process by killing it uncleanly, thereby
  requiring crash recovery when the postmaster restarts.

If the provider or subscriber hosts are shut down, there is a
possibility of corruption of the filesystem which might invalidate one
or another of the databases; Slony-I can't really help with that.

We've seen such corruptions emerge from the following sorts of
phenomena:

  - IBM HACMP failover captured requests on the SCSI bus and tried to
    re-apply them, trashing the filesystem + database;

  - A gradual power outage might leave some fading signals on the SCSI
    or fibrechannel bus as the computer "died," leading to the disk
    array getting phantom writes, trashing the filesystem + database.

Do any of these failure modes seem familiar?  (e.g. - indicative of
what happened here?)
-- 
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://cbbrowne.com/info/lsf.html
Rules  of the  Evil Overlord  #145. "My  dungeon cell  decor  will not
feature exposed pipes.  While they add to the  gloomy atmosphere, they
are good  conductors of vibrations and  a lot of  prisoners know Morse
code." <http://www.eviloverlord.com/>