lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: luke and chinese text
Date Thu, 22 Dec 2011 13:04:20 GMT
On 22/12/2011 13:50, Peyman Faratin wrote:
> Hi
>
> We are indexing some chinese text (using the following outputstreamwriter with UTF-8
enconding).
>
> OutputStreamWriter outputFileWriter  = new OutputStreamWriter(new FileOutputStream(outputFile),
"utf8");
>
> using lucene 3.2. The analyzer is
>
> new LimitTokenCountAnalyzer(new SmartChineseAnalyzer(Version.LUCENE_32,Stopwords),Integer.MAX_VALUE)
>
> Hi
>
> We are now trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 option in
Luke), but it seems to be garbled. We see a lot of "???". According to  http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/decoders/StringDecoder.java
>
>   issue should be in
>
>    public String decodeTerm(String fieldName, Object value) {
>
>
>      if (value == null) {
>
>
>        return "(null)";
>
>
>      } else if (value instanceof BytesRef) {
>
>
>        return ((BytesRef)value).utf8ToString();
>
>
>      } else {
>
>
>        return value.toString();
>
>
>      }
>    }
>
> In this function, the value should be instance of  BytesRef, then calling the
> .utf8ToString() function will decode the BytesRef to java utf8 string. However, for unknown
reason, for our index, the value is not BytesRef, I also tested it is not CharsRef.

Hmm, then what is it? Just add a println(value.getClass().getName()) and 
see what it is.


> So the toString() method is called on the value object and result is some ???.

I suspect that the issue could be with the display font - please select 
from the Settings menu a font that supports Unicode characters, the 
default platform font often doesn't support them, which results in '?' 
or other strange characters.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message