lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject RE: Sorting with little memory: A suggestion
Date Fri, 19 Mar 2010 22:07:26 GMT
From: Robert Muir [rcmuir@gmail.com]:
> Right, JDK collation sucks, use the ICU for collation keys too:
> http://site.icu-project.org/charts/collation-icu4j-sun
> at 1.59 bytes/char, thats less than UTF-16

Ah... I should have seen that. I does not change the overall picture though: Althought the
ICU collation keys are impressively small, they still take up nearly as much space as the
original Strings when they themselves are represented as Strings. Thus the collation keys
does not help memory usage (much).

When they are stored as bytes, it helps significantly, but even then there's still a huge
difference between having them in-memory and using an array of positions. Even with optimal
storing (the collator keys takes up exactly the number of bytes they contain), an index of
10M documents with 10M unique terms of length 20 in a sort field would use about 300MB for
a given locale vs. the 10M*log2(10M)/8 = 27MB for a compressed order array.

Still, depending on how little space a byte-array will take in flex, using the indexed collator
key approach might turn out to be the best choice in a lot of cases as it works really well
for incremental updates.

Regards,
Toke Eskildsen 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message