lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Posting unicode data to lucene not working during searching/retreival!
Date Thu, 21 May 2009 06:33:50 GMT
Indexed data is coming out in the same way as put in. Lucene works with Java
Strings, so encoding is irrelevant. When you index your values, you must be
sure, to construct your index string/char arrays correctly using the UTF-8
encoding (e.g. by using a standard Java Reader, new String byte[], charset)
and so on. When you then print stored fields you must do the same in the
other direction. So the general rule: Always specify the correct charset
when converting to/from strings to bytes.
For searching: It roughly also depends also on the Analyzer used during
indexing and searching. Often analyzers written for specific languages
cannot correctly handle characters from foreign languages. But e.g.
StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any way
(if making them lowercase is not a problem).

Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: KK []
> Sent: Thursday, May 21, 2009 3:25 PM
> To:
> Subject: Posting unicode data to lucene not working during
> searching/retreival!
> How to post utf-8 unicoded data to lucene index. Do we have to specify
> something special, any sort of flag saying that we're posting unicoded
> data?
> I tried to post some utf-8 encoded data, during retrieval I'm not able to
> see those data , there are just "?" marks in all those places. Earlier I
> was
> using Solr and I was posting using the same method and retreival was also
> working fine, but I dont' know what is the issue with lucene, may be I'm
> missing something. Can someone tell me what could be the issue? Thank you.
> KK,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message