lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into field cache
Date Fri, 26 Mar 2010 09:44:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850055#action_12850055
] 

Toke Eskildsen commented on LUCENE-2335:
----------------------------------------

Just a little status on development.

The SegmentReader has been modified to implement the {{getOrdinalTerms}} stated above and
it works as expected (no surprise, as this is just a refactoring and cleanup of the code from
the proof of concept). I'm moving on to {{DirectoryReader}}. When that is done, I'll make
a preliminary patch. For easy experimentation, I could try and make {{the new Sort(new SortField("myfield",
new Locale("da")));}} use the new code.

Sorting 10M terms still takes about 6 minutes on my machine. I've profiled the code and about
40% of the time is spend on requesting terms (with a cache of 20000 terms, locale da, using
a conventional harddisk). There's room for optimization by smarter caching of terms and of
course using a faster Collator than Java's default. Using simple {{String.compareTo}}, which
leads to nearly sequential access to terms, takes 44 seconds for the 10M terms. The key to
real improvement seems to be a better exploitation of the observation that "Unicode order
is fairly aligned to most Locale-based orders". It is already working somewhat as dumb in-memory
merge-sort of the 10M Strings takes about 10 minutes. I'll save that for later though.

Now, for merges there is some decisions to make. The primary one being how to handle terms
that are the same for different segments. For the first take, I'll avoid merging such terms
and thus allow the iterator to deliver the same term more than once for {{DirectoryReader}}s,
but with different ordinals. This should work fine for direct building of a docID->sortedOrdinalIndex
map, usable for sorting, which is the primary use-case.

If the structure is to be used for faceting, the terms instances needs to be collapsed. Luckily
the logic for doing so resides at the code that used the iterator, rather than internally
in the readers. Using Exposed for memory-efficient faceting should still be feasible with
the general code. If the order of the tags in the facet is not significant, the Unicode-order-comparator
can be used, making it very fast (1-2 minutes / 10M unique terms) to build the structure for
a given field. ...But I digress and will try and get back on track with the real code.

> optimization: when sorting by field, if index has one segment and field values are not
needed, do not load String[] into field cache
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2335
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2335
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>
> Spinoff from java-dev thread "Sorting with little memory: A suggestion", started by Toke
Eskildsen.
> When sorting by SortField.STRING we currently ask FieldCache for a StringIndex on that
field.
> This can consumes tons of RAM, when the values are mostly unique (eg a title field),
as it populates both int[] ords as well as String[] values.
> But, if the index is only one segment, and the search sets fillFields=false, we don't
need the String[] values, just the int[] ords.  If the app needs to show the fields it can
pull them (for the 1 page) from stored fields.
> This can be a potent optimization -- alot of RAM saved -- for optimized indexes.
> When fixing this we must take care to share the int[] ords if some queries do fillFields=true
and some =false... ie, FieldCache will be called twice and it should share the int[] ords
across those invocations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message