lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Hacking Luke for bytecount-based strings
Date Wed, 17 May 2006 20:33:22 GMT

On May 17, 2006, at 11:08 AM, Doug Cutting wrote:

> Marvin Humphrey wrote:
>> What I'd like to do is augment my existing patch by making it   
>> possible to specify a particular encoding, both for Lucene and Luke.
> What ensures that all documents in fact use the same encoding?

In KinoSearch at this moment, zilch.  Lucene would still need to read  
stuff into Java chars then write it out using the specified  
encoding.  If we opt for output buffering rather than output counting  
(the patch currently does counting, but that would have to change if  
we're flexible about encoding in the index), then string.getBytes 
(encoding) would guarantee it.

> The current approach of converting everything to Unicode and then  
> writing UTF-8 to indexes makes indexes portable and simplifies the  
> construction of search user interfaces, since only indexing code  
> needs to know about other character sets and encodings.

Sure.  OTOH, it's not so good for CJK users.  I also opted against it  
in KinoSearch because A) compatibility with the current Java Lucene  
file format wasn't going to happen anyway, and B) not all Perlers use  
or require valid UTF-8.  I've considered adding a UTF8Enforcer  
Analyzer subclass, but it hasn't been an issue.  Right now, if your  
source docs are mucked up, they'll be mucked up when you retrieve  
them after searching.  If you want to fix that, you preprocess.   
Ensuring consistent encoding is the application developer's  

> If a collection has invalidly encoded text, how does it help to  
> detect that later rather than sooner?

I *think* that whether it was invalidly encoded or not wouldn't  
impact searching -- it doesn't in KinoSearch.  It should only affect  
display.  Detecting invalidly encoded text later doesn't help  
anything in and of itself; lifting the requirement that everything be  
converted to Unicode early on opens up some options.

>> Searches will continue to work regardless because the patched   
>> Termbuffer compares raw bytes. (A comparison based on  
>> Term.compareTo () would likely fail because raw bytes translated  
>> to UTF-8 may not  produce the same results.)
> UTF-8 has the property that bytewise lexicographic order is the  
> same as Unicode character order.

Yes.  I'm suggesting that an unpatched TermBuffer would have problems  
with my index with corrupt character data because the sort order by  
bytestring may not be the same as sort order by Unicode code point.   
However, the patched TermBuffer uses compareBytes() rather than  
compareChars(), so TermInfosReader should work fine.

Marvin Humphrey
Rectangular Research

    public final int compareTo(TermBuffer other) {
      if (field == other.field)			  // fields are interned
-      return compareChars(text, textLength, other.text,  
+      return compareBytes(bytes, bytesLength, other.bytes,  
        return field.compareTo(other.field);

-  private static final int compareChars(char[] v1, int len1,
-                                        char[] v2, int len2) {
+  private static final int compareBytes(byte[] bytes1, int len1,
+                                        byte[] bytes2, int len2) {
      int end = Math.min(len1, len2);
      for (int k = 0; k < end; k++) {
-      char c1 = v1[k];
-      char c2 = v2[k];
-      if (c1 != c2) {
-        return c1 - c2;
+      int b1 = (bytes1[k] & 0xFF);
+      int b2 = (bytes2[k] & 0xFF);
+      if (b1 != b2) {
+        return b1 - b2;
      return len1 - len2;

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message