lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Malformed XML with exotic characters
Date Thu, 03 Feb 2011 12:33:45 GMT
Hi

I've seen almost all funky charsets but gothic is always trouble. I'm also 
unsure if its really a bug in Solr. It could well be the Xerces being unable 
to cope. Besides, most systems indeed don't go well with gothic. This mail 
client does, but my terminal can't find its cursor after (properly) displaying 
such text.
 
http://got.wikipedia.org/wiki/%F0%90%8C%B7%F0%90%8C%B0%F0%90%8C%BF%F0%90%8C%B1%F0%90%8C%B9%F0%90%8C%B3%F0%90%8C%B0%F0%90%8C%B1%F0%90%8C%B0%F0%90%8C%BF%F0%90%8D%82%F0%90%8C%B2%F0%90%8D%83/Haubidabaurgs

Thanks for the input.

Cheers,

On Tuesday 01 February 2011 19:59:33 Robert Muir wrote:
> Hi, it might only be a problem with your xml tools (e.g. firefox).
> the problem here is characters outside of the basic multilingual plane
> (in this case Gothic).
> XML tools typically fall apart on these portions of unicode (in lucene
> we recently reverted to a patched/hacked copy of xerces specifically
> for this reason).
> 
> If you care about characters outside of the basic multilingual plane
> actually working, unfortunately you have to start being very very very
> particular about what software you use... you can assume most
> software/setups WON'T work.
> For example, if you were to use mysql's "utf8" character set you would
> find it doesn't actually support all of UTF-8! in this case you would
> need to use the recent 'utf8mb4' or something instead, that is
> actually utf-8!
> Thats just one example of a well-used piece of software that suffers
> from issues like this, there are others.
> 
> Its for reasons like these that if support for these languages is
> important to you, I would stick with the most simple/textual methods
> for input and output: e.g. using things like CSV and JSON if you can.
> I would also fully test every component/jar in your application
> individually and once you get it working, don't ever upgrade.
> 
> In any case, if you are having problems with characters outside of the
> basic multilingual plane, and you suspect its actually a bug in Solr,
> please open a JIRA issue, especially if you can provide some way to
> reproduce it
> 

Mime
View raw message