[Slony1-general] child terminated status: 11 -> restart of worker in 10 seconds

Fri Jun 4 06:35:35 PDT 2010

Hello!

To summarize: I have a large DB (about 15gb on disk), where i can start
replication, and first it seems ok, but when the slave has almost caught
up (judging by the size on disk) the slon process starts to segfault.

I compiled a slon (same version, 1.2.15) with debugging symbols, did
"ulimit -c unlimited" and also "ulimit -n 32768", since i also got
errors about "Too many open files".

And now i finally got around to producing a core dump (see below).

Is there any hope? This is Ubuntu 9.04, Postgresql 8.3.7 and Slon 1.2.15.

There is a 1.2.16 version of Slony available for Postgres 8.3 in Ubuntu.
Any chance that that could solve the problem? Other options i see are
too dump/restore the master DB. Other than that... Any ideas? :)

  wbr / Alex

So it segfaults while logging?
===================================================================
Core was generated by `/usr/bin/slon -f /etc/slony1/bigdb/slon.conf -p
/var/run/slony1/bigdb'.
Program terminated with signal 11, Segmentation fault.
[New process 31964]
[New process 31965]
[New process 31966]
[New process 31949]
[New process 31967]
[New process 31971]
[New process 31954]
[New process 31952]
#0  0x00007fcfd5e37c40 in strlen () from /lib/libc.so.6
(gdb) bt
#0  0x00007fcfd5e37c40 in strlen () from /lib/libc.so.6
#1  0x00007fcfd5e0075e in vfprintf () from /lib/libc.so.6
#2  0x00007fcfd5eb3738 in __vsnprintf_chk () from /lib/libc.so.6
#3  0x0000000000418764 in slon_log (level=<value optimized out>,
fmt=0x420093 " ssy_action_list value: %s\n") at
/usr/include/bits/stdio2.h:78
#4  0x000000000040b4c5 in sync_event (node=0x1659f90, local_conn=<value
optimized out>, wd=0x1659920, event=0x7fcfcc008f20) at remote_worker.c:4334
#5  0x000000000040dcf8 in remoteWorkerThread_main (cdata=0x1659f90) at
remote_worker.c:630
#6  0x00007fcfd61303ba in start_thread () from /lib/libpthread.so.0
#7  0x00007fcfd5e9cfcd in clone () from /lib/libc.so.6
#8  0x0000000000000000 in ?? ()
(gdb)
===================================================================

Steve Singer wrote:
> Alexander Kolodziej wrote:
>> Hello!
>> ...
>> 2010-05-25 11:16:00 UTC DEBUG2 remoteListenThread_1: queue event
>> 1,2533 SYNC
>> 2010-05-25 11:16:00 UTC DEBUG2 slon: child terminated status: 11; pid:
>> 29027, current worker pid: 29027
>> 2010-05-25 11:16:00 UTC DEBUG1 slon: restart of worker in 10 seconds
>> --------------------------
>>
>> In syslog i see this on the slave (slon segfault errors every 10s):
>> --------------------------
>> May 25 11:16:30 semc-sh62 kernel: [20053518.436336] slon[29076]:
>> segfault at 273936 ip 00007fd69e8bac40 sp 00007fd69ad48698 error 4 in
>> libc-2.9.so[7fd69e83a000+168000]
>> May 25 11:16:40 semc-sh62 kernel: [20053528.548794] slon[29104]:
>> segfault at 273936 ip 00007f359f4f4c40 sp 00007f359b982698 error 4 in
>> libc-2.9.so[7f359f474000+168000]
>> --------------------------
>>
> 
> Can you rub gdb against a core file, or start slony up inside of gdb, so
> we can get a stack trace of what slon was doing went it died?
> 
> (from a build with debugging symbols would be even more useful)
> 
> 
>> What could cause this?
>> Looking at the size of /var/lib/postgresql/8.3/ i can see that it has
>> almost succeeded in replicating the DB, but something is going boink.
>>
>> Slon loglevel is set to 4.
>> Are there any slon sl_* tables i can look in for info?
>> Tried google but only get 8 results on: "child terminated status: 11" +
>> "restart of worker", and none of those provide a solution.
>>
>>   wbr / Alexander
>> _______________________________________________
>> Slony1-general mailing list
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
> 
>