lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Tue, 06 Apr 2010 11:02:35 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853884#action_12853884
] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

The current implementation accepts Comparator<Object> (which must accept Strings) as
well as a Locale (which is converted to Collator.getInstance(locale) under the hoo)d as arguments.
Plugging in the ICU collator directly should be trivial. If/when it gets possible to use byte[]
for sorters in general, I'll add support for that.

Indexing ICU collator keys and using them in combination with LUCENE-2369 is an interesting
idea, as it would speed up the building process quite a lot, while keeping the memory usage
down. As long as fillFields=false, the two methods are independent as should work well with
each other. Fairly easy to try.

For fillFields=true, it gets a bit trickier and requires a special FieldComparatorSource that
keeps two maps from docID: One to the ICU collator key, one to the original term. Still, it
should not be that hard to implement and I'll be happy to do it if the fillFields=false-case
turns out to work well.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain
previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message