lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Faratin <pey...@robustlinks.com>
Subject luke and chinese text
Date Thu, 22 Dec 2011 12:50:18 GMT
Hi 

We are indexing some chinese text (using the following outputstreamwriter with UTF-8 enconding).


OutputStreamWriter outputFileWriter  = new OutputStreamWriter(new FileOutputStream(outputFile),
"utf8");

using lucene 3.2. The analyzer is

new LimitTokenCountAnalyzer(new SmartChineseAnalyzer(Version.LUCENE_32,Stopwords),Integer.MAX_VALUE)

Hi 

We are now trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 option in Luke),
but it seems to be garbled. We see a lot of "???". According to  http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/decoders/StringDecoder.java

 issue should be in

  public String decodeTerm(String fieldName, Object value) {


    if (value == null) {


      return "(null)";


    } else if (value instanceof BytesRef) {


      return ((BytesRef)value).utf8ToString();


    } else {


      return value.toString();


    }
  }

In this function, the value should be instance of  BytesRef, then calling the 
.utf8ToString() function will decode the BytesRef to java utf8 string. However, for unknown
reason, for our index, the value is not BytesRef, I also tested it is not CharsRef. So the
toString() method is called on the value object and result is some ???.

BytesRef and CharsRef is Lucene defined class, to further debug this we may need to dig into
Lucene code then. Since we dont know what is the real Object type value is, if the real type
did not overwrite toString function, then value.toString() is the default java Object implementation
which is the hashcode of this object and from eclipse debugger I saw hashcode is 0。


Any advice would be appreciated

thank you

Peyman


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message