lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Thu, 02 Sep 2010 13:49:55 GMT


Toke Eskildsen commented on LUCENE-2369:

They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.

Yes, you have stated that repeatedly.

As you are unwilling to answer my questions, I've tweaked my tests to see for myself. It worked
very well, as I learned something new. To answer one of the questions, then yes, the terms
are loaded into memory when a sort is performed on a field. This is the case for the locale-based
sort (yes, I have understood that you do not consider that a viable form of sorting and I
only write it for completeness), for STRING-sorting (natural order with ordinal based speedup)
and for STRING_VAL-sorting (natural order without the ordinal-speedup). All this against Lucene
truck, no patches.

This is why i want to factor out the whole 'locale' thing from the issue, since sorting is
agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss

For my new tests I've switched to natural order sorting (direct byte[] comparison aka Lucene
STRING sorted search without specifying a Locale). The tests should be fairly telling for
the different scenarios as the ICU keys should be about the same size as the original terms.

    * The apis here are wrong anyway: it shouldnt take Locale but Collator.
      There is no way to set strength or any other options, and theres no way to supply a
Collator i made myself (e.g. from RuleBasedCollator)

I fully agree. The code I've made so far takes a Comparator<BytesRef>, with optimization
for wrapped Collators. The title of LUCENE-2369 is not technically correct but was chosen
as "Locale" is a fairly known concept while "Collator" is more complex. That might have been
a mistake.

Onwards to testing with natural order (new Sort(new SortField(myField, SortField.STRING))).
No ZIPping in the background this time, so measurements will differ from previous test. Heap
sizes were measured after a call to System.gc().

2M document index, search hits 1M documents, top 10 hits extracted:

    * Initial exposed search: 0:16 minutes
    * Subsequent exposed searches: 40 ms
    * Total heap usage for Lucene + exposed structure: 20 MB
    * Initial default Lucene search: 0.8 s
    * Subsequent default Lucene searches: 25 ms
    * Total heap usage for Lucene + field cache: 60 MB

20M document index, search hits 10M documents, top 10 hits extracted:

    * Initial exposed search: 2:53 minutes
    * Subsequent exposed searches: 330 ms
    * Total heap usage for Lucene + exposed structure: 154 MB
    * Initial default Lucene search: 6 s
    * Subsequent default Lucene searches: 220 ms
    * Total heap usage for Lucene + field cache: 600 MB

200M document index, search hits 100M documents, top 10 hits extracted:

    * Initial exposed search: 31:33 minutes
    * Subsequent exposed searches: 3200 ms
    * Total heap usage for Lucene + exposed structure: 1660 MB
    * No data for default Lucene search as there was OOM with 7 GB of heap.

What did we learn from this?

   * Natural order search in Lucene with STRING is very fast (as most of the work is ordinal
   * Exposed sorting is actually slower than natural order (that's news for me). The culprit
is a modified PackedInts-structure. I'll look into that.
   * The exposed structure build penalty for the hybrid approach (e.g. relying on natural
order instead of doing an explicit sort) was indeed markedly lower than exposed with explicit
sorting. A factor 5. I would have expected it to be more though.
   * The hybrid approach uses less than a third of the amount of RAM required by Lucene natural
order sorting.

So, Robert, does this answer your challenge "if you have a way to improve memory usage for
byte[] terms, lets look just at that?"?

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>                 Key: LUCENE-2369
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>         Attachments: LUCENE-2369.patch
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to which contain
previous discussions on the subject.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message