[Slony1-general] strategy to fix utf8 encoding errors

Sun May 28 15:16:28 PDT 2006

On 5/27/06, Vivek Khera <vivek at khera.org> wrote:
> I have a database (rt3) in a postgres 8.0 server which has UNICODE
> encoding.  It was replicated to another 8.0 DB just fine for a long
> time.  Today I upgraded the replica to 8.1 and when I went to
> replicate it, I got UTF8 encoding failure from one of the tables:
> 'invalid byte sequence for encoding "UTF8": 0xa9'
>
> Aside from playing whack-a-mole and fixing the errors one at a time
> as they are reported by slon, what can I do to make the data UTF8
> safe for the strict checking of Pg 8.1?
>
> And what does one do to figure out what character to replace or do
> you generally just cut the offending character from the row?
>

When migrated from 7.4 to 8.1, we had problems with bad characters.
There was a small set of bad characters, usually characters which
hadn't been translated to UTF-8 but were in the original latin-1 or
windows-1252 character set.

Luckily, UTF-8 strings are pretty distinctive.  It is pretty easy to
write a regex which only matches valid UTF-8 strings.  You could
either run that against a dump, every column  in eveyr table, or
particular problem columns.  If you have a good idea of what the
original character set was and what characters you can expect, then
you can translate them to Unicode.

 - Ian