lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Empty Sink Tokenizer
Date Wed, 01 Apr 2009 16:07:13 GMT
On Wed, Apr 1, 2009 at 10:28 AM, Grant Ingersoll <> wrote:
> On Mar 31, 2009, at 2:38 PM, Michael McCandless wrote:
>> There are two separate things, here.
>> First is that indexed fields are now processed in alpha order
>> (stable/partial sort for multivalued fields), as of 2.3.  That I think
>> is something internal to Lucene and I'm not sure we should make
>> promises one way or another in what order Lucene visits the fields on
>> a document (maybe someday multiple threads will run on fields... who
>> knows).
>> This I think was the original problem reported on java-user, because
>> Lucene tried to pull tokens from the sink before it was filled.
>> I'm not sure how best to fix Sink/TeeTokenizer to "be flexible".
>> Maybe we could change it so that whichever TokenStream is pulled
>> first, it then pulls from the true source and populates the other as a
>> sink.  Then when the other is used, it's always populated.
> Interesting, that could work, but it would require some work to achieve.  Of
> course, the workaround is to just document the collation factor and then
> people can rename fields, I guess.  I'll try to find some spare time to play
> around with this.  Personally, the Sink/Tee stuff could be used in more
> places than it is for things like copy fields and extraction problems.

I'm not sure we should document it, since that implies we won't change it.

I think on the lists we can say "this is how it currently works but it
could change in any release", ie, you're using an "undocumented
feature" when you rely on order that Lucene processes the fields.

>> So a TeeTokenFilter would take a single source and export two copies
>> (TokenFilters) and you'd use those copies in your fields.
>> The second issue is that when you store fields in a document and
>> retrieve the document later, the fields have been sorted by name.  I
>> think this is basically another case of "the document you provided
>> during indexing isn't the same thing as what you retrieve at search
>> time" (as Yonik said).
>> I'm also not sure we should promise it (though, reverting to the
>> pre-2.3 approach is possible, and easier than changing order that we
>> invert fields), or maybe we should wait until we work out input vs
>> output documents.  And, apparently not many people have noticed this
>> 2nd issue...
> Agreed.  I think 3.0 warrants some rework of Documents as Hoss, Yonik and
> others have suggested, but are we then supposed to have it figured out by
> 2.9 in order to deprecate the existing approach?

(Yes we'd need to have the new way working, deprecating the old way, in 2.9).

I too would love to see this done in time for 2.9.... any volunteers out there?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message