lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <s...@elyograg.org>
Subject Re: Invalid UTF-8 character 0xfffe during shard update
Date Mon, 05 Aug 2013 18:54:49 GMT
On 8/5/2013 12:12 PM, Federico Chiacchiaretta wrote:
> Hi Raymond,
> I agree with you, 0xfffe is a special character, that is why I was asking
> how it's handled in solr.
> In my document, 0xfffe does not appear at the beginning, it's in the
> content.

I believe that 0xfffe not a valid UTF-8 character, and its presence 
indicates something is wrong with your postgres driver, server, or the 
data in the database.  I use a UTF-8 encoded mysql database with Solr 
and have no problems.  I've used most versions between 1.4.0 and 4.4.0.

Although I'm sure that UTF-8 and UNICODE are not exactly the same thing 
for all characters, I think that for this particular case we can treat 
them the same:

en.wikipedia.org/wiki/Specials_(Unicode_block)

Relevant excerpt:  "FFFE and FFFF are not unassigned in the usual sense, 
but guaranteed not to be a Unicode character at all. They can be used to 
guess a text's encoding scheme, since any text containing these is by 
definition not a correctly encoded Unicode text. The U+FEFF is Unicode's 
byte-order mark, named "zero-width no-break space" (as inclusion of it 
in text shall not be noticed). If this character is read in the wrong 
byte order (for example, due to an endianness bug), it will read 0xFFFE, 
which is illegal Unicode."

See also the error table at the end of this amazon documentation page, 
which DOES talk about UTF-8 rather than Unicode:

http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html

Thanks,
Shawn


Mime
View raw message