incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant McLean <gr...@catalyst.net.nz>
Subject Re: [lucy-user] Index state during merges
Date Wed, 02 Nov 2011 19:20:55 GMT
On Wed, 2011-11-02 at 21:03 +0200, goran kent wrote:
> WIth precisely this in mind, my code does some gymnastics to try and
> make sure bad utf8 doesn't make it in.  But,... you never know when
> dealing with the vagaries of the 'tubes.

It's not uncommon for web sites to lie about the encoding of the content
they serve up.  In particular, ASCII, UTF-8, ISO8859-1 and CP1252 are
all completely interchangeable - up to the point where they're not.

My https://metacpan.org/module/Encoding::FixLatin module is designed to
help in dealing with that sort of situation and especially the case
where a single document contains bytes from more than one encoding.

Cheers
Grant


Mime
View raw message