lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Order of fields within a Document in Lucene 2.4+
Date Wed, 01 Jul 2009 09:22:56 GMT
Sorry, yes, this was my fault with the indexing speedups in 2.3
(LUCENE-843): as of 2.3, if any fields have term vectors enabled, the
fields are sorted lexicographically.  As of 2.4 (LUCENE-1301,
refactoring the indexing core), that sort happens even without term
vectors.

Hoss I see you've opened an issue for this (LUCENE-1727) for this;
I'll take that & fix for 2.9.

Sorry,

Mike

On Tue, Jun 30, 2009 at 9:20 PM, Mark Miller<markrmiller@gmail.com> wrote:
> Yeah, I've heard rumblings about this issue before. I can't remember what
> patch changed it though - one of Mike M's I think?
>
> On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter
> <hossman_lucene@fucit.org>wrote:
>
>>
>> Hmmm... i'm not an expert on the internals of indexing, and i don't use
>> FieldSelectors much, but this seems like a pretty big bug to me ... or at
>> the very least: a change in behavior that completely eliminates the value
>> of LOAD_AND_BREAK.
>>
>> https://issues.apache.org/jira/browse/LUCENE-1727
>>
>>
>>
>> : The Lucene FAQ says...
>> :
>> : What is the order of fields returned by Document.fields()?
>> : * Fields are returned in the same order they were added to the document.
>> : (now getFields() as fields is deprecated)
>> :
>> : However I think this may no longer be the case in 2.4
>> :
>> : We are indexing documents in a specific order so that we can
>> LOAD_AND_BREAK out of our FieldSelector as early as possible.
>> : i.e. we have typically 50 indexed fields for a document, but when we are
>> loading results with .doc(), we know we only need 4 of them.
>> :
>> : So, our code ensures that these are added to the index first - and once
>> the 4th field is loaded we break out of the selector.
>> :
>> : This speeds us up by an order of magnitude.
>> :
>> :
>> :
>> : However, we are finding that our field selector is processing fields in
>> alphabetical order, not order of addition.  This means that we'd have to
>> rename our fields to 'aaa..' in order to guarantee they'd be processed
>> first.
>> :
>> :
>> : I think, but am not sure, that this bit of code causes the problem (as
>> spotted in
>> http://www.mail-archive.com/java-user@lucene.apache.org/msg24105.html).
>> : It seems to have been introduced in version 2.4 (fields are in addition
>> order in 2.3.2)
>> :
>> : DocFieldProcessorPerThread.java:
>> :
>> :    // If we are writing vectors then we must visit
>> :    // fields in sorted order so they are written in
>> :    // sorted order.  TODO: we actually only need to
>> :    // sort the subset of fields that have vectors
>> :    // enabled; we could save [small amount of] CPU
>> :    // here.
>> :    quickSort(fields, 0, fieldCount-1);
>> :
>> :
>> : This appears to sort fields into alphabetical order.
>> :
>> : Assuming that implementing the TODO would keep them in order of addition
>> (and just keep vectors fields themselves sorted) - is it worth raising a
>> JIRA to fix this ?
>> :
>> :
>> : regards,
>> :
>> : matt
>> :
>> :
>> :
>> :
>> : _________________________________________________________________
>> : Get the best of MSN on your mobile
>> : http://clk.atdmt.com/UKM/go/147991039/direct/01/
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message