Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <743112236.18781270574373473.JavaMail.jira@brutus.apache.org>
Date: Tue, 6 Apr 2010 17:19:33 +0000 (UTC)
From: "Toke Eskildsen (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-2369) Locale-based sort by field with low
 memory overhead
In-Reply-To: <142973736.4201270547613607.JavaMail.jira@brutus.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D128=
54069#action_12854069 ]=20

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

A few experiments with the current implementation:

40GB index, 5 segments, 7.5M documents, 5.5M unique sort terms, 87M terms t=
otal, sort locale da, top-20 displayed with fillFields=3Dtrue. Just opening=
 the index without any Sort requires 140MB.
Standard sorter: -Xmx1800m, 26 seconds for first search
Exposed sorter: -Xmx350m, 7 minutes for first search (~4=C2=BD minutes for =
segment sorting, ~2=C2=BD minutes for merging).

Fully warmed searches, approximate mean:
6.5M hits: standard 2500 ms, exposed 240 ms
4.1M hits: standard 1600 ms, exposed 190 ms
2.1M hits: standard 900 ms, exposed 90 ms
1.2M hits: standard 500 ms, exposed 45 ms
0.5M hits: standard 220 ms, exposed 40 ms
0.1M hits: standard 80 ms, exposed 6 ms
1.7K hits: standard 3 ms, exposed <1 ms

=20
2.5GB index, 4 segments, 420K documents, 240K unique sort terms, 11M terms =
total, sort locale da, top-20 displayed with fillFields=3Dtrue. Just openin=
g the index without any Sort requires 18MB.
Standard sorter: -Xmx120m, 2 seconds for first search
Exposed sorter: -Xmx50m, 14 seconds for first search (9 seconds for segment=
 sorting, 5 seconds for merging).

Fully warmed searches, approximate mean:
420K hits: standard 170 ms, exposed 15 ms
200K hits: standard 85 ms, exposed 9 ms
100K hits: standard 50 ms, exposed 8 ms
 10K hits: standard 6 ms, exposed 0-1 ms

As can be seen, the timings are fairly consistent for this small ad-hoc tes=
t. The difference between standard and exposed sorting is the time it takes=
 for the collator to perform compares. I'll have to test if that can be imp=
roved by using a plain int-array to hold the order of the documents, just a=
s the non-locale-using String sorter does.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldC=
ache which keeps all sort terms in memory. Beside the huge memory overhead,=
 searching requires comparison of terms with collator.compare every time, m=
aking searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pr=
e-sorted ordinals for the sort terms and a map from document-IDs to entries=
 in the sorted ordinals list. This results in very low memory overhead and =
faster sorted searches, at the cost of increased startup-time. As the ordin=
als can be resolved to terms after the sorting has been performed, this app=
roach supports fillFields=3Dtrue.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-233=
5 which contain previous discussions on the subject.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org