Wed Feb 20 16:13:37 PST 2008
- Previous message: [Slony1-general] proper procedure for re-starting slony after replication slave reboots
- Next message: [Slony1-general] proper procedure for re-starting slony after replication slave reboots
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Christopher, I appreciate your efforts as well as those of everyone else on the list. I'm glad to see you folks haven't give up on me yet. :) Christopher Browne wrote: > Geoffrey <lists at serioustechnology.com> writes: >> Andrew Sullivan wrote: >>> I am by no means willing to dismiss the suggestion that there are bugs in >>> Slony; but this still looks to me very much like there's something we don't >>> know about what happened, that explains the errors you're seeing. >> I would so love to figure out this issue. I appreciate your efforts. >> >> I simply don't understand how one table inparticular could get so far >> out of sync. We're talking 300 records. >> >> I can't imagine that slony is that fragile. There's got to be >> something going on that we don't see. > > I agree. From what I have heard, it doesn't sound like you have > experienced anything that should be scratching any of the edge points > of Slony-I. > > 300 records don't just disappear. > > When I put this all together, I'm increasingly suspicious that you may > have experienced hardware problems or some such thing that might cause > data loss that Slony-I would have no way to address. Understand, I'm not saying that I'm losing data, just that there are inconsistencies between the replication server and the primary. I don't believe we are losing data on the primary at all. What I see is the number of records in tables don't match, thus the replication process is not working as expected. The weird thing is, not every table is affected, just a handful. We're talking 88 tables and 84 sequences, but only 4 tables have problems. Here's a comparison of record counts: < count for adest 54055 --- > count for adest 54056 65c65 < count for mcarr 22560 --- > count for mcarr 22572 67c67 < count for mcust 63757 --- > count for mcust 63774 94c94 < count for tract 75380 --- > count for tract 75420 This hardware has been rock solid since it was installed. If we were losing data on the primary, we would definitely hear about it. One thing I didn't mention is the actual configuration. Two boxes connected to a single data silo. It's a hot/hot configuration. Separate postmaster for each database. Half the postmasters run on one server, the other half on the other. If/when one fails, the other picks up the postmaster processes. Each database has it's own IP, so I reference the host by multiple host names. Connect to database mwr via host mwr. In the event of failure, mwr IP is moved to the other machine. <snip> > You've grown suspicious about *every* component, which, on the one > hand, is unsurprising, but on the other, not much useful. I haven't > heard you mention anything that would cause me to expect Slony-I to > have eaten data, or to have even "started to look hungrily at the > data." The only reason I keep looking at slony is because the system is rock solid. We don't lose data and these boxes are up 24/7. Folks are hitting them constantly. Slony is the only new part of the equation. > The notices you have mentioned are all benign things. The one > question that comes to mind: Any interesting ERROR messages in the > PostgreSQL logs? I'm getting more and more suspicious that something > about the entire DB cluster has gotten unstable, and if that's the > case, Slony-I wouldn't do any better than the DB it is running on... There are no postgresql errors to speak of on the primary. I do see the following in the postgresql log on the slave: 2008-02-19 19:30:59 [3216] NOTICE: type "_mwr_cluster.xxid" is not yet defined DETAIL: Creating a shell type definition. 2008-02-19 19:30:59 [3216] NOTICE: argument type _mwr_cluster.xxid is only a shell 2008-02-19 19:30:59 [3216] NOTICE: type "_mwr_cluster.xxid_snapshot" is not yet defined DETAIL: Creating a shell type definition. 2008-02-19 19:30:59 [3216] NOTICE: argument type _mwr_cluster.xxid_snapshot is only a shell Since these are NOTICEs, I assume this is normal. During the initial replication, I do see a number of: 2008-02-19 19:32:28 [2463] LOG: checkpoints are occurring too frequently (6 seconds apart) But our problem doesn't seem to start until after the initial replication. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
- Previous message: [Slony1-general] proper procedure for re-starting slony after replication slave reboots
- Next message: [Slony1-general] proper procedure for re-starting slony after replication slave reboots
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list