Wed Dec 7 08:09:18 PST 2005
- Previous message: [Slony1-general] Reducing pg_listener bloat - Pt 1
- Next message: [Slony1-general] Database encoding issues and slony
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all, This is an issue I've been working through the last few days. PostgreSQL 8.0, as some of you may know, had a pretty poor unicode implementation. In fact, it allowed know unicode sequences to enter the database. So, this affects any user using slony with a utf-8 encoded 8.0 database. If you plan to use slony1 to update an 8.0 database to 8.1 may be for a surprised. Say I have a database running 8.0, utf-8 encoded. It will accept the following sequence as being utf-8 compliant: \060\242. This is *not* utf-8 compliant. 8.1 detects this. When slony goes to copy the data into the 8.1 database, the following occurs: --- 2005-12-07 17:56:13 EST DEBUG4 remoteWorkerThread_1: Begin COPY of table "public"."accounts" 2005-12-07 17:56:13 EST ERROR remoteWorkerThread_1: copy from stdin on local node - PGRES_FATAL_ERROR ERROR: Invalid UNICODE byte sequence detected near byte \uffff CONTEXT: COPY accounts, line 1, column dat: "0\uffff" 2005-12-07 17:56:13 EST WARN remoteWorkerThread_1: data copy for set 1 failed - sleep 60 seconds --- The \uffff is generated by my terminal, which recognises utf-8. \uffff means this is not a unicode character. Since such data cannot be replicated, slony cannot be used as an upgrade tool which the invalid sequences are present in the database. The solution, I believe, is to dump the tables being replicated out to a text file, and do the following: $ iconv -f utf-8 -t utf-8 -o dump.utf8 dump iconv will tell you which characters/sequences are invalid. You must then find the correct sequences or characters and update the affected rows in the database. The problem is, repairing the invalid characters/sequences is very hard if you have a lot of data or data from different character sets. Consider the following: 0000000 245 1 0 0 i s p r e t t y 221 0000020 c h e a p 222 Now, \245 is the Yen symbol in latin1 but \221 and \222 are curly quote marks from windows-1251. If you suffer this problem, I think the only viable option is to think about the kinds of character sets user applications may access when moving data into your database. For english language countries, this falls into three categories, in my experience: ascii (no problems), iso-8859* (latin*) and windows-125*. The windows characters I've seen are usually left and right inverted commas, ellipses and the like. Luckily, these are not mapped by 7bit ascii or iso-8859-1. A script which identifies non-utf-8 characters and provides some context, line numbers, etc, will greatly speed up the process of remedying the situation. Once the correct sequence/character is found, update the affected row in the database. Once all affected rows are updated, replication can commence unhindered. There is another problem, however. Slony is not handling databases encoded in different character sets. Say i wanted to upgrade my 8.0 system, which is encoded latin1, to an 8.1 system encoded in utf-8. This is not possible. You will see the same error as above. The thing is, PostgreSQL can do character conversion between different character sets and provides convenient mechanisms, like 'set client_encoding ...', to deal with this. Slony can utilise this by setting the client encoding to the encoding of the server it is working on. The problem should go away then. I will develop a patch to this effect over the next day or so, unless anyone sees a problem. Unfortunately, this will not help with the first problem I detailed. Gavin
- Previous message: [Slony1-general] Reducing pg_listener bloat - Pt 1
- Next message: [Slony1-general] Database encoding issues and slony
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list