lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew May <a...@ingenta.com>
Subject Indexing UTF-8
Date Thu, 10 Aug 2006 15:17:08 GMT
Hi,

I'm trying to index some UTF-8 data, but I'm experiencing some problems.

I'm using the 28th July nightly build, which I believe contains all the recent fixes for 
making the administration webapp use UTF-8. I've tried running in both the provided Jetty

instance and Tomcat 5.5.17.

I've indexed both using the post.sh script (i.e. curl) and HttpClient both with the same 
results.

I'm specifically concentrating on one author name that has been causing problems:
Ayyıldız, Turhan
(I'm encoding this email as UTF-8 in the hope that comes through OK)

What I'm seeing coming back from Solr is:
Ayyıldız, Turhan
The undotted lowercase i Turkish character (U+0131) is instead appearing as a latin 
capital A with diaeresis (U+00C4) and a plus-minus character (U+00B1).

Using Luke to look at the index directly the field appears as:
AyyÄ&#177;ldÄ&#177;z, Turhan
Which assuming Luke is displaying this correctly (&#177; is ±) means something happened
in 
the posting of the data or the indexing.

I'm completely out of my depth when it comes to character encodings, so I don't know 
whether I'm doing something stupid, mis-configuring something, or whether this is a 
genuine problem not of my own making.

Any thoughts?

Thanks,

Andrew

Mime
View raw message