lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: who clears attributes?
Date Wed, 12 Aug 2009 07:14:33 GMT

> +1. We don't use Solr, but have quite a bunch of medium and
> short-sized documents. Plus heaps of metadata fields.
>
> I'm yet to read Uwe's example, but I feel I'm a bit misunderstood by
>    

Did you read it yet? What do you think about it?

> some of you. My gripe with new API is not that it brings us troubles
> (which are solved one way or another), it is that the switch and
> associated migration costs bring zero benefits in immediate and remote
> future.
> The only person that tried to disprove this claim is Uwe. Others
> either say "the problems are solved, so it's okay to move to the new
> API", or "this will be usable when flexindexing arrives". Sorry, the
> last phrase doesn't hold its place, this API is orthogonal to
> flexindexing, or at least nobody has shown the opposite.
>    

If the API is orthogonal to flexible indexing or not depends on how you 
define "flexible indexing". I admit the term is vague and probably 
nowhere clearly defined.

I agree that if flexible indexing means to only change the encoding, 
i.e. *how* data is stored, e.g. PFOR vs. the current posting format, 
then yes, we don't need the new TokenStream API for it.

But the goals we have with flexible indexing are more than that. We want 
to allow customizing *what* data is stored in the inverted index. The 
very first discussion about flexible indexing that happened several 
years ago you can find in the wiki: 
http://wiki.apache.org/lucene-java/FlexibleIndexing.

Already in this very early proposal it was suggested to have the 
following posting formats as a start:
a. <doc>+
b. <doc, boost>+
c. <doc, freq, <position>+ >+
d. <doc, freq, <position, boost>+ >+

For d. you need to change the TokenStream API. How else can we get the 
boost from the source to the indexer. Of course you can always serialize 
the additional data into the payload byte array, but if filters want to 
do something with it performance suffers. The new API solves this 
problem very nicely. When we open the posting format like this people 
will want to store different custom things in there. The new TokenStream 
API is prepared for that - the old one isn't.

  Michael

> So, what I'm arguing against is adding some code (and forcing users to
> migrate) just because we can, with no other reasons.
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message