lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Wed, 01 Sep 2010 12:46:54 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905011#action_12905011
] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

{quote}
I don't see why federated search needs anything but sort keys?
{quote}

Because the sort terms themselves are needed when a search-result from your service is merged
with a result from a source that you do not control by an aggregator that is not you.

{quote}
The "memory" overhead is no different than the "overhead" of regular terms, there is nothing
special about the collation key case, this is my point (see below). and in practice for most
people, its encoded as way less than 2 bytes/char.
{quote}

We're clearly not talking about the same thing here. Maybe I've misunderstood something. Let
me try and rephrase myself and break it down so that we can pinpoint where the problem is.

1) Opening a Lucene index and performing a search with relevance ranking requires X bytes
of heap.
2) Performing a locale-based sorted search with Lucene 3.0.2 takes X + A bytes, where A is
relatively large as all Terms from the sort field is kept in memory as Strings.
3) Performing the same search with Lucene trunk takes X + B bytes, where B is quite a lot
smaller than A as BytesRefs are used for the Terms kept in memory.
4) Performing the same search on an ICU key infused index with Lucene trunk + ICU magic takes
X + C bytes, where C is about the same size as B.
5) Performing the same search doing pre-sorting (what I'm doing) takes X + D bytes, where
D is smaller than C. None of the Terms are kept in memory.

It seems to me that you are assuming that the sort-terms are kept in memory in case 5? Or
not kept in memory in any of the cases 3, 4 and 5?

{quote}
Because "search-time" collator-sorting is the wrong approach, and should not exist at all.
{quote}

Opinion noted. Circular argumentation skipped.

{quote}
Indexing with collation keys once we fix LUCENE-2551 has:

    * same startup time as regular terms
    * approximately the same memory usage as regular terms [e.g. PRIMARY key for "Robert Muir"
is 12 bytes versus 11 bytes]
    * same execution time (binary compare) as regular terms
{quote}

Indexing with pre-sorting (or whatever we call what I'm trying to do) has

    * Huge startup time (or index commit time penalty if we move it to indexing)
    * Lower memory usage that sorting with regular terms or ICU keys
    * Faster execution time (single integer compare) than LUCENE-2551 or regular terms

Have I misunderstood something here? When a sorted search with ICU keys is performed, the
keys themselves are still compared to each other for each search, right?


Combining the two approaches and doing the pre-sorting at index time, we have

    * Time penalty at index commit (1 min / 10M terms? More? This requires real testing)
    * Faster startup time than regular term or ICU key sorting (load two PackedInts structures)
    * Lower memory usage that sorting with regular terms or ICU keys
    * Faster execution time (single integer compare) than LUCENE-2551

Bad for real-time, good for fast sorting and low memory usage.

If you agree on this breakdown, the next logical step for me is to create a performance (speed
& heap) test for the different cases to see whether the memory savings and the alleged
faster sorting with integer comparison is enough to warrant the hassle.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain
previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message