lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <>
Subject [jira] Updated: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into field cache
Date Tue, 06 Apr 2010 11:04:34 GMT


Toke Eskildsen updated LUCENE-2335:

    Comment: was deleted

(was: Lotsa devils in the details when you're poking around in the belly of Lucene, but modulo
some business with deleted documents, it looks fine for simple (no parallel and multi readers)
usage. fillFields=true works just as it should by delaying the actual term resolving until
the documents ID's are determined. The current code makes it possible to create an exposed
Sort quite easily:
      ExposedFieldComparatorSource exposedFCS =
          new ExposedFieldComparatorSource(reader, new Locale("da"));
      Sort sort = new Sort(new SortField("mySortField", exposedFCS));

For the curious, a modified Lucene-JAR can be downloaded at
and tested with
java -cp lucene-core-3.1-dev-LUCENE-2335-20100405.jar org.apache.lucene.index.ExposedPOC expose
<index> <sortField> <locale> <defaultField>
this will present the user with a simple shell where searches can be performed. Heap-usage
and execution times are displayed along with the search result.

I did a little bit of real world experimenting: A 2.5GB index with 400K documents with 320K
unique sort terms took 14 seconds to open. After that, a locale-based sorted search that hit
250K documents and returned the first 50 took 30ms (fully warmed by re-searching 5 times).
Max heap was specified to 40MB of which 20MB was used after the building of the sort structure
was finished.

The same search using the standard locale-oriented sorter took about 1 second to start up,
After that, the 250K search took 130ms, fully warmed. Max heap was specified to 100MB.

The default sorter was able to get by with 80MB, but execution-time increased drastically
to 2000ms. Probably because of the GC-overhead that the Collator introduces by temporarily
allocating two new objects for each comparison.

The bad news is that this is quite a bit of code (400+ extra lines for the SegmentReader alone)
with several levels of indirection in the data structures. As an example, getting the actual
term for a given docID in the ExposedFieldComparatorSource is done with
        final long resolvedDocOrder = docOrder.get(order[slot]);
        return resolvedDocOrder == undefinedTerm ? null : reader.getTermText(
which is not easily digested without a very thorough explanation, preferably with a diagram.

The API-changes to the IndexReaders is the addition of two methods:
String getTermText(int ordinal) throws IOException;
is self-explanatory, but 
ExposedIterator getExposedTuples(
      String persistenceKey, Comparator<Object> comparator, String field,
      boolean collectDocIDs) throws IOException;
is real hard to do when writing a new IndexReader. My current approach is to use an interface
ExposedReader with the methods and let the updated IndexReaders implement that, thereby making
it optional for IndexReaders to be Exposed. 

LUCENE-2335 seems to fulfill the promises so far, with the previously discussed trade-offs:
* Long startup (~1 min/1M documents, less on re-open)
* Fast locale-based sorted search (supports fillFields=true and has near-zero GC overhead)
* Very low memory overhead (both permanent and for initializing)

Regarding making a proper patch, I would like to know what I should patch against. I use LUCENE-1990
so I need to do it against a fairly new version. I can see that Flex is about to be merged,
so I guess it would make sense to wait for that one.)

> optimization: when sorting by field, if index has one segment and field values are not
needed, do not load String[] into field cache
> ------------------------------------------------------------------------------------------------------------------------------------
>                 Key: LUCENE-2335
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
> Spinoff from java-dev thread "Sorting with little memory: A suggestion", started by Toke
> When sorting by SortField.STRING we currently ask FieldCache for a StringIndex on that
> This can consumes tons of RAM, when the values are mostly unique (eg a title field),
as it populates both int[] ords as well as String[] values.
> But, if the index is only one segment, and the search sets fillFields=false, we don't
need the String[] values, just the int[] ords.  If the app needs to show the fields it can
pull them (for the 1 page) from stored fields.
> This can be a potent optimization -- alot of RAM saved -- for optimized indexes.
> When fixing this we must take care to share the int[] ords if some queries do fillFields=true
and some =false... ie, FieldCache will be called twice and it should share the int[] ords
across those invocations.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message