lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Wed, 01 Sep 2010 13:49:56 GMT


Robert Muir commented on LUCENE-2369:

bq. Do they or do they not need to be loaded into heap in order to be used for sorted search?

They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc. 
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
This is why i want to factor out the whole 'locale' thing from the issue, since sorting is
agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss

bq. Easy now. The whole runtime-vs-index-time issue is something that I don't care much about
at this point. Pre-sorting can be done both at index and search time. Let's just say that
we do it at index-time and go from there.

Well, the thing is, its something i care a lot about. The problems are:
* Users who develop localized applications tend to use methods with Locale/Collator parameters
if they are available: its best practice.
* In the case of lucene, it is not best practice, but a silly trap (as you get horrible performance).
* However, users are used to the concept of collation keys wrt indexing (e.g. when building
a database index)
* The apis here are wrong anyway: it shouldnt take Locale but Collator. 
There is no way to set strength or any other options, and theres no way to supply a Collator
i made myself (e.g. from RuleBasedCollator)

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>                 Key: LUCENE-2369
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to which contain
previous discussions on the subject.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message