lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Wed, 26 Mar 2008 22:35:03 GMT

Yonik Seeley wrote:
> On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> Yonik Seeley <yonik@apache.org> wrote:
>>
>>>  Hmmm, can't we always do it by unicode code point?
>>>  When do we need UTF-16 order?
>>
>>  In theory, we can.  I think the sort order doesn't matter much, as
>>  long as everyone (writers & readers) agree what it is.  I think
>>  unicode code point order is more "standards compliant" too.
>>
>>  A big benefit is then we could leave things (eg TermBuffer and maybe
>>  eventually Term, FieldCache) as UTF8 bytes and save on the  
>> conversion
>>  cost when reading.
>>
>>  But I don't think Java provides a way to do this comparison?   
>> However
>>  it's not hard to implement your own:
>>
>>   http://www.icu-project.org/docs/papers/utf16_code_point_order.html
>
> Not sure I follow... you just do a byte-by-byte comparison right?  For
> ASCII, this should be slightly faster (same number of comparisons,
> less memory space and hence less cache space overall).

Sorry, you're right: if you're working with byte[] at the time, a  
byte by byte comparison of UTF8 gives you the same order as unicode  
code point.

It's when you need to compare a String or char[] to one another, or  
to a UTF8 byte[], that you need that code.

>>  But then I worried about how much slower that code is than
>>  String.compareTo, and, I found alot of places where innocent  
>> compareTo
>>  or < or > needed to be changed to this method call.  Field name
>>  comparisons would have to be fixed too.  Then for backwards
>>  compatibility all of these places that do comparisons would have to
>>  fallback to the Java way when interacting with an older segment.
>
> Oh... older segments.  Yeah, I was speaking "theoretically".

Yeah.

>>  I think we can still explore this?  It just seemed way too big to
>>  glomm into the already-big changes in LUCENE-510.
>
> Yeah, I was thinking of some of this more along the lines of Lucene 3.
> A term could contain a byte array instead of a String.  A String
> constructor would convert to UTF8 and then do lookups in the index
> (simple byte comparisons, no charset encoding).  A byte constructor
> for Term would also be allowed.  Things like TermEnumerators would
> keep everything in bytes, the tii would be in bytes, etc.

Yup.

> One could also think about ways to directly index bytes too.

Right, DocumentsWriter could hold its terms in byte[] and save time/ 
space when terms are ascii.

> Is it all worth it?  I really don't know.

Right, that's where I started to wonder.  It felt very much like I  
was "going against the grain of Java" as the changes started to pile  
up ...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message