lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Wed, 01 Sep 2010 11:51:57 GMT


Robert Muir commented on LUCENE-2369:

bq. I was thinking aggregation, but you are right. For aggregation one would of course just
use the keys and have no need for the original Strings. Then we're left with federated search.

I don't see why federated search needs anything but sort keys?

bq. That is the memory overhead. If you have 20M terms of average length 10 chars, that is
400MB in raw bytes and quite a bit more when you're taking pointers into account.

The "memory" overhead is no different than the "overhead" of regular terms, there is nothing
special about the collation key case, this is my point (see below). and in practice for most
people, its encoded as way less than 2 bytes/char.

I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone
the sorting decision to search time. What is the alternative? Right now, search-time collator-based
sorting with field cache has low startup time, high memory usage and horrible execution time
for large results.

Because "search-time" collator-sorting is the wrong approach, and should not exist at all.

Indexing with collation keys once we fix LUCENE-2551 has:
* same startup time as regular terms
* approximately the same memory usage as regular terms [e.g. PRIMARY key for "Robert Muir"
is 12 bytes versus 11 bytes]
* same execution time (binary compare) as regular terms

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>                 Key: LUCENE-2369
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to which contain
previous discussions on the subject.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message