lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Hacking Luke for bytecount-based strings
Date Wed, 17 May 2006 21:04:48 GMT
Marvin Humphrey wrote:
> I *think* that whether it was invalidly encoded or not wouldn't  impact 
> searching -- it doesn't in KinoSearch.  It should only affect  display.  

I think Java's approach of converting everything to unicode internally 
is useful.  One must still handle dirty input, but it easy to write 
output that conforms to standards.  I'd hate to lose that.

Java programs have a good reputation for supporting 
internationalization, better than those written in languages that 
primarily represent strings as byte arrays and library utilities for 
handling encodings and character sets.  Java's choice of 16-bit 
characters may have been an error, but the general approach of 
converting all textual data to unicode internally has led to fewer 
internationalization issues than are common in other systems.

> Detecting invalidly encoded text later doesn't help  anything in and of 
> itself; lifting the requirement that everything be  converted to Unicode 
> early on opens up some options.

How useful are those options?  Are they worth the price?  Converting to 
unicode early permits one to, e.g., write encoding-independent 
tokenizers, stemmers, etc.  That seems like a lot to throw away.

>> UTF-8 has the property that bytewise lexicographic order is the  same 
>> as Unicode character order.
> Yes.  I'm suggesting that an unpatched TermBuffer would have problems  
> with my index with corrupt character data because the sort order by  
> bytestring may not be the same as sort order by Unicode code point.   

I think you're saying that bytewise comparisons involving invalid UTF-8 
may differ from comparisons of the unicode code points they represent. 
But if they're invalid, they don't actually represent unicode code 
points, so how can they be compared?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message