[Slony1-general] Database encoding issues and slony

Wed Dec 7 08:09:18 PST 2005

Hi all,

This is an issue I've been working through the last few days. PostgreSQL
8.0, as some of you may know, had a pretty poor unicode implementation. In
fact, it allowed know unicode sequences to enter the database. So, this
affects any user using slony with a utf-8 encoded 8.0 database.

If you plan to use slony1 to update an 8.0 database to 8.1 may be for a
surprised. Say I have a database running 8.0, utf-8 encoded. It will
accept the following sequence as being utf-8 compliant: \060\242. This is
*not* utf-8 compliant. 8.1 detects this.

When slony goes to copy the data into the 8.1 database, the following
occurs:

---

2005-12-07 17:56:13 EST DEBUG4 remoteWorkerThread_1: Begin COPY of table
"public"."accounts"
2005-12-07 17:56:13 EST ERROR  remoteWorkerThread_1: copy from stdin on
local node - PGRES_FATAL_ERROR ERROR:  Invalid UNICODE byte sequence
detected near byte \uffff
CONTEXT:  COPY accounts, line 1, column dat: "0\uffff"
2005-12-07 17:56:13 EST WARN   remoteWorkerThread_1: data copy for set 1
failed
- sleep 60 seconds

---

The \uffff is generated by my terminal, which recognises utf-8. \uffff
means this is not a unicode character.

Since such data cannot be replicated, slony cannot be used as an upgrade
tool which the invalid sequences are present in the database. The
solution, I believe, is to dump the tables being replicated out to a text
file, and do the following:

$ iconv -f utf-8 -t utf-8 -o dump.utf8 dump

iconv will tell you which characters/sequences are invalid. You must then
find the correct sequences or characters and update the affected rows in
the database. The problem is, repairing the invalid characters/sequences
is very hard if you have a lot of data or data from different character
sets. Consider the following:

0000000 245   1   0   0       i   s       p   r   e   t   t   y     221
0000020   c   h   e   a   p 222

Now, \245 is the Yen symbol in latin1 but \221 and \222 are curly quote
marks from windows-1251.

If you suffer this problem, I think the only viable option is to think
about the kinds of character sets user applications may access when moving
data into your database. For english language countries, this falls into
three categories, in my experience: ascii (no problems), iso-8859*
(latin*) and windows-125*. The windows characters I've seen are usually
left and right inverted commas, ellipses and the like. Luckily, these are
not mapped by 7bit ascii or iso-8859-1.

A script which identifies non-utf-8 characters and provides some context,
line numbers, etc, will greatly speed up the process of remedying the
situation. Once the correct sequence/character is found, update the
affected row in the database. Once all affected rows are updated,
replication can commence unhindered.

There is another problem, however. Slony is not handling databases encoded
in different character sets. Say i wanted to upgrade my 8.0 system, which
is encoded latin1, to an 8.1 system encoded in utf-8. This is not
possible. You will see the same error as above. The thing is, PostgreSQL
can do character conversion between different character sets and provides
convenient mechanisms, like 'set client_encoding ...', to deal with this.

Slony can utilise this by setting the client encoding to the encoding of
the server it is working on. The problem should go away then. I will
develop a patch to this effect over the next day or so, unless anyone sees
a problem. Unfortunately, this will not help with the first problem I
detailed.

Gavin