[Slony1-general] Re: slon won't start after EXECUTE QUERY

Sat Nov 13 21:43:06 PST 2004

I guess my little problem was a bit vague, but I'm somewhat surprised no-one 
thought of what now seems like the obvious reason that slon wouldn't start. I 
was right in that node 2 died on the DDL SCRIPT. Then, everytime I tried to 
restart slon, it re-received the event, and died again. I'm still quite 
confused as to why the script fails on only this node (especially since the 
exact same script succeeds when run from psql).

Anyways, I need to get this problem fixed, which means one of three things: 
deleting the DDL SCRIPT event from node 1, adding a fake confirmation from 
node 2, or changing the script itself (the one saved in sl_event) to something 
that will definitely succeed.

Can anyone tell me which would be the best solution, and more importantly, how 
to do it safely?

David Pitkin

-------Original Message-------------------------
Hello,

I'm brand new to SlonyI. Someone else set it up with a master node (node 1)
and two slaves (node 2 and node 3). I needed to change the schema, and have
successfully managed to break node 2 in the process (happily this is still in
the development stage). Here's what happened. Hopefully someone can tell me
what I did wrong:

1. First, I should mention that node 1 and node 2 are on the same machine
(Linux), with node 3 on a seperate machine. I needed to change the data type
of a column, using sql like this:
ALTER TABLE table ADD COLUMN field_new;
UPDATE table SET field_new = field;
ALTER TABLE table DROP COLUMN field;
ALTER TABLE table RENAME COLUMN field_new TO field.

2. I ran this script using the EXECUTE QUERY command in slonik. It failed
initially, because I forgot that the schema containing the table I needed to
modify was not in the search path for the 'slony' user. It failed on node 1,
and appeared to be isolated there (i.e. the event did not get sent to the
other two nodes). I've checked the Schemadoc, and this seems to be what
happens. I also double checked the process list at that point, and verified
that two slon processes were still running (for nodes 1 and 2).

3. I fixed the script and ran it a second time. It succeeded on node 1, and on
node 3. But node 2 was unchanged, and further investigation showed that the
corresponding slon process was dead. I tried restarting it, and it complained
a few times about there being no remote worker thread for node 1, and died
with an empty error message.

4. I manually fixed the schema on node 2, and started slon again. Slon died in
the same way.

I checked the slonyI tables, and it appears the node 2 confirmed the SYNC
event sent by node 1 just before the DDL_SCRIPT event (the timestamps of both
events match). This suggests that the script killed node 2, and a quick glance
at the remote worker thread source code suggests that if a script were to
fail, the thread would immediately die. But I can't figure out why the slon
process refuses to restart.

Does anyone have any thoughts?

David Pitkin