lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Tue, 31 Aug 2010 21:37:56 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904780#action_12904780
] 

Robert Muir commented on LUCENE-2369:
-------------------------------------

bq. ICU collator keys makes sorting very fast at the cost of some extra disk space, as one
will probably want to store the original Term together with the key. It requires a non-trivial
memory overhead, in the ideal case as many bytes as there are characters in the terms. Works
extremely well with reopening.

This doesnt make sense, why do you need the original term also?

What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size)
is in general less than 2 bytes per character. this is actually less than the cost of a term
in ram in lucene 3.1, so i don't understand this?

bq. The two approaches are not in conflict and combining them would indeed seem to give many
benefits

if you are using collation keys, then binary order gives you collated results. So thats what
I am hinting at here, is there a more general improvement here you can apply to sorting bytes?
If this issue has some ideas that can improve the more general case, I think we should look
at factoring those improvements out, and leave the locale stuff as an indexing-time thing.

bq. I agree that the sort-fields as well as sort-locale is well known at index time in most
cases.

In all cases really. I don't see this issue really helping if you dont know the locale at
index time, by invoking the collator over all the terms at startup you are essentially reindexing
in RAM.

if one doesnt know the necessary locales at index-time, i suggest using a generic UCA collator:
ULocale.ROOT as a 'catch-all' field for all other locales.


> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain
previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message