lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Wed, 01 Sep 2010 07:27:54 GMT


Toke Eskildsen commented on LUCENE-2369:

This doesnt make sense, why do you need the original term also?

I was thinking aggregation, but you are right. For aggregation one would of course just use
the keys and have no need for the original Strings. Then we're left with federated search.

What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size)
is in general less than 2 bytes per character. this is actually less than the cost of a term
in ram in lucene 3.1, so i don't understand this?

That is the memory overhead. If you have 20M terms of average length 10 chars, that is 400MB
in raw bytes and quite a bit more when you're taking pointers into account. When I'm talking
memory overhead, my baseline is a newly opened Lucene index without field caching of terms.
Having a beefy machine and a small index is trivial. Low-end hardware, virtualized servers
and huge indexes all calls for conserving memory. PackedInts helps a lot, avoiding storing
the terms (or ICU keys) i RAM helps more.

if you are using collation keys, then binary order gives you collated results. So thats what
I am hinting at here, is there a more general improvement here you can apply to sorting bytes?
If this issue has some ideas that can improve the more general case, I think we should look
at factoring those improvements out, and leave the locale stuff as an indexing-time thing.

My approach is in theory very simple: Provide an order mapping for each segment, merge the
order maps for index level. Having segments with the ICU keys already in sorted order would
make the segment maps 1:1 (i.e. they do not require a sort) leaving only the index level merging
of order maps. This would halve the build time and together with the 10 times speedup (guessing
from the ICU website) that ICU gives, doing the calculation for 20M terms in my example above
should take less than a minute. As the caches for the segments are then superfluous, the memory
requirements are halved and thus providing faster sort at a memory cost of 100MB as compared
to ICU sort speed and 400MB of memory.

I don't see this issue really helping if you dont know the locale at index time, by invoking
the collator over all the terms at startup you are essentially reindexing in RAM.

I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone
the sorting decision to search time. What is the alternative? Right now, search-time collator-based
sorting with field cache has low startup time, high memory usage and horrible execution time
for large results.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>                 Key: LUCENE-2369
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to which contain
previous discussions on the subject.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message