lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead
Date Tue, 06 Apr 2010 17:19:33 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854069#action_12854069
] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

A few experiments with the current implementation:

40GB index, 5 segments, 7.5M documents, 5.5M unique sort terms, 87M terms total, sort locale
da, top-20 displayed with fillFields=true. Just opening the index without any Sort requires
140MB.
Standard sorter: -Xmx1800m, 26 seconds for first search
Exposed sorter: -Xmx350m, 7 minutes for first search (~4½ minutes for segment sorting, ~2½
minutes for merging).

Fully warmed searches, approximate mean:
6.5M hits: standard 2500 ms, exposed 240 ms
4.1M hits: standard 1600 ms, exposed 190 ms
2.1M hits: standard 900 ms, exposed 90 ms
1.2M hits: standard 500 ms, exposed 45 ms
0.5M hits: standard 220 ms, exposed 40 ms
0.1M hits: standard 80 ms, exposed 6 ms
1.7K hits: standard 3 ms, exposed <1 ms

 
2.5GB index, 4 segments, 420K documents, 240K unique sort terms, 11M terms total, sort locale
da, top-20 displayed with fillFields=true. Just opening the index without any Sort requires
18MB.
Standard sorter: -Xmx120m, 2 seconds for first search
Exposed sorter: -Xmx50m, 14 seconds for first search (9 seconds for segment sorting, 5 seconds
for merging).

Fully warmed searches, approximate mean:
420K hits: standard 170 ms, exposed 15 ms
200K hits: standard 85 ms, exposed 9 ms
100K hits: standard 50 ms, exposed 8 ms
 10K hits: standard 6 ms, exposed 0-1 ms

As can be seen, the timings are fairly consistent for this small ad-hoc test. The difference
between standard and exposed sorting is the time it takes for the collator to perform compares.
I'll have to test if that can be improved by using a plain int-array to hold the order of
the documents, just as the non-locale-using String sorter does.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps
all sort terms in memory. Beside the huge memory overhead, searching requires comparison of
terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals
for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This
results in very low memory overhead and faster sorted searches, at the cost of increased startup-time.
As the ordinals can be resolved to terms after the sorting has been performed, this approach
supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain
previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message