lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Thu, 23 Sep 2010 13:48:35 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Toke Eskildsen updated LUCENE-2369:
-----------------------------------

    Attachment: LUCENE-2369.patch

This patch is to keep in sync with Lucene trunk (20100923) and to explore some ideas. Besides
the updated code with some bug fixing and some optimization, there's sample code for faceting
and index lookup (check out the unit-test TestExposedFacets.testScale). I know that this does
not belong in Lucene core, so see it as a demonstration of the potential in providing the
doc/term mappings.

Now, revisiting the previous test with the updated code and this time actually remembering
not to do an explicit sort in the exposed-part (simulating that ICU collator keys are indexed),
the numbers are

2M document index, search hits 1M documents, top 10 hits extracted:
  * Opening the index and doing a plain relevance-sorted search: 3 MB
  * Initial exposed search: 3.5 seconds
  * Subsequent exposed searches: 40-60 ms
  * Total heap usage for Lucene + exposed structure: 23 MB
  * Initial default Lucene sorted search: 1.0 seconds
  * Subsequent default Lucene searches: 30-35 ms
  * Total heap usage for Lucene + field cache: 61 MB

20M document index, search hits 10M documents, top 10 hits extracted:
  * Opening the index and doing a plain relevance-sorted search: 27 MB
  * Initial exposed search: 44 seconds
  * Subsequent exposed searches: 350-380 ms
  * Total heap usage for Lucene + exposed structure: 183 MB
  * Initial default Lucene sorted search: 6.7 seconds
  * Subsequent default Lucene searches: 220-240 ms
  * Total heap usage for Lucene + field cache: 614 MB

200M document index, search hits 100M documents, top 10 hits extracted:
  * Opening the index and doing a plain relevance-sorted search: 210 MB
  * Initial exposed search: 7:35 minutes
  * Subsequent exposed searches: 3320-3550 ms
  * Total heap usage for Lucene + exposed structure: 1744 MB
  * No data for default Lucene search as there was OOM with 7 GB of heap.

While the time for first search is still substantial, it is a lot shorter than the previous
measurements. Lucene natural order sorting is still nearly double as fast (I haven't tried
switching to int[] instead of PackedInts yet, so that part is not closed). I'll try and find
the time to do some more detailed tests with a more realistic number of hits, but I estimate
that the speed will be the same, relative to Lucene natural order sort.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>         Attachments: LUCENE-2369.patch, LUCENE-2369.patch
>
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain
previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message