[Slony1-general] Slony Replication in wide-area applications?

Mon Apr 21 09:24:06 PDT 2008

Vivek Khera <vivek at khera.org> writes:
> On Apr 21, 2008, at 9:05 AM, Bill Moran wrote:
>
>> I don't understand the question.  What do you mean by "network
>> partition" and how does this represent a failure scenario?
>
> Normally all hosts can see every other one. When your network is
> partitioned, you end up with at least two subsets which can still see
> every other host within that subset, bot none of the hosts in the
> other subset(s).

Slony-I can work fine with network configurations where there are such
partitions where you have clusters[1] of nodes in LANs, where there is
limited communications across a WAN.

Consider:
- Nodes 1-3 are in a LAN at Data Centre A
- Nodes 4-6 are in a LAN at Data Centre B

There are constrictions...
- 1-3 can easily "talk amongst themselves."
- Likewise, 4-6 can easily "talk amongst themselves."
- However, we pick #3 and #4 as being the only nodes that are allowed
  to talk with one another across the WAN

There are configurations you cannot create, in such a case:
- You cannot have any configuration where nodes 5 or 6 subscribe directly to
  1-3; they *MUST* go thru node 4
- Likewise, you cannot have any configuration where nodes 1 or 2
  subscribe directly to 4-6; they *MUST* go thru node 3

I don't think network partitioning represents a particularly
compelling problem.

The essential WAN problems are three-fold:

- If the WAN is flakey, a frequently observed problem is that
  connections will have failed, but the database connection doesn't
  actually get dropped by the DB server until a TCP/IP timeout takes
  place, which often takes 2-3 hours.

  During that time, attempts for a slon to reconnect will be rebuffed
  because the old connection is still there, even though there is no
  way for it to be used.  This is somewhat of a moral equivalent to a
  zombie process; the old DB connection is unusable, but doesn't know
  it's dead.

  There's probably some way to automate cleaning the old connection
  out, though it's not something Slony-I could do itself, and I
  haven't tried constructing such a cleanup process.

- If the WAN is sufficiently flakey, it may be problematic to keep a
  transaction running across the WAN for long enough to get a
  subscription going.

- If the WAN is sufficiently flakey, then you may not have enough
  network bandwidth to keep a replica fed.

(Those represent three problems that are different from one another in
their essences...)

Footnotes: 
[1]  In this case, "cluster" isn't in the Slony-I sense, but rather
simply "a bunch of nodes."
-- 
(format nil "~S@~S" "cbbrowne" "cbbrowne.com")
http://linuxfinances.info/info/sgml.html
"I think you ought to know I'm feeling very depressed"
-- Marvin the Paranoid Android