Jeff Frost jeff at pgexperts.com
Mon Aug 24 07:27:39 PDT 2009
Karl Denninger wrote:
>
>
> Jeff Frost wrote:
>> Karl Denninger wrote:
>>>>   =

>>> But they should have switched the master to Node #4 when the move
>>> set command was executed.  When they reconnect they should be doing
>>> so to Node #2, not Node #2 - IF they saw the "move set" command (and
>>> it appears they did.)
>>>
>>> Further, I ran the change in the paths on that node - that is,
>>> locally to that machine.  No difference.
>> When you indicate that you ran the store path on that node, can you
>> be specific about what you did?
>>
>>
>>>>> I'm wondering what happened here.  It is almost as if the "move set"
>>>>> never executed on the other subscribers - an impossibility, no?  They
>>>>> WERE all replicating and current just before the shutdown - I checked
>>>>> them all.  How does that happen under these circumstances?
>>>>>
>>>>> Is there a better way for the future?  I'm back up now, but the entire
>>>>> point of this exercise was to AVOID having to copy the entire database
>>>>> over - while I avoided any material downtime for my users, I was left
>>>>> EXPOSED to a failure for the copy period, which was kinda nasty.
>>>>>
>>>>> Thoughts appreciated.
>>>>>
>>>>>     =

>>>>
>>>> Probably the way to avoid it would have been to issue the store path
>>>> changes before switching the ports.  But, if you forget to do it in the
>>>> future, you can fix it afterwards by going bare metal and updating the
>>>> paths in the _tickerform.sl_path table on the nodes that don't have the
>>>> correct information.
>>>>
>>>>   =

>>> I still don't understand why the node change wasn't picked up by
>>> these slaves when the move set executed; I would have expected that
>>> this would be the case (that is, it would be expecting Node #4 to be
>>> the master) and although it showed up on the "wrong" ip address a
>>> store path should have fixed that.
>>>
>>> It APPEARS that it was looking for the old master on Node #2....
>>> implying (I think) that it never saw the move set.
>>>
>>> Or am I misunderstanding how the internals work here?
>>>
>>
>> I don't think the problem is that it didn't see the move set, I think
>> the problem is that it didn't get the store path commands because it
>> didn't connect to the 'new' master after you changed the ports out
>> from under it.  I don't think slony is well designed for having the
>> paths changed out from under it and you'll likely have to fix them by
>> hand when you do this.
>>
>> I'm pretty sure what happened (and hopefully someone will correct me
>> if I'm wrong) is even though you ran the slonik store path command on
>> the broken node, slonik connected to the new master, updated the
>> master's DB with the store path info and put this event in the log to
>> propagate out to the slaves.  Unfortunately, because the broken slave
>> still had the old path in the sl_path table, it didn't know how to
>> connect to the new master and therefore never received the new path
>> information. =

>>
> But the log says it DID receive the new path information - when I
> executed the "store paths" on the client the log file for slon on that
> client immediately reflected that the path configuration had been
> changed.  So clearly, it saw it on the local host.
>
Do you still have the logs sitting around?  Can you post them?
> I have since dropped the old database (which was running as a "safety"
> overnight using "drop node" and of note that DID drop the schema as
> the replication was torn down......
>
Ah! You're right!  From the docs: " If the replication daemon is still
running on that node (and processing events), it will attempt to
uninstall the replication system and terminate itself."  Interestingly,
I've never seen it do that. I actually went and checked in the docs all
the way back to 1.2.10 and it says the same thing. I suppose I've almost
always dropped a broken node, so it probably wouldn't have processed the
event. =

> Its pretty clear to me that something went wrong during the move set -
> but exactly what and why I can't reproduce at the present time.
>
> I'll have to see if I can set up a "sandbox" and try this in an
> isolated environment to see if I can figure out why it happened and
> hopefully prevent myself from getting bit like this again.
>
>
Always an excellent idea.  Report back with your results and how to
reproduce any strange behavior you find.


-- =

Jeff Frost <jeff at pgexperts.com>
COO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
http://www.pgexperts.com/ =


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20090824/=
11f0bc37/attachment.htm


More information about the Slony1-general mailing list