lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Baby steps towards making Lucene's scoring more flexible...
Date Tue, 09 Mar 2010 18:18:12 GMT
On Tue, Mar 9, 2010 at 10:03 AM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote:
>> > For what it's worth, that's sort of the way KS used to work: Schema/FieldType
>> > information was stored entirely in source code.  That's changed and now we
>> > serialize the whole schema including all Analyzers, but source-code-only is
a
>> > viable approach.
>>
>> Hmm but KS still somehow enforced strong typing across indexing
>> sessions?
>
> Nope, it wasn't enforced.

OK.

>> You said "of course" before but... how in your proposal could one
>> store all stats for a given field during indexing, but then sometimes
>> use match-only and sometimes full-scoring when querying against that
>> field?
>
> The same way that Lucene knows that sometimes it needs a docs-only-enum and
> sometimes it needs a docs-and-positions enum.  Sometimes you need scores,
> sometimes you don't.

But if user had specified BM25Sim when indexing... can they later just
change that to MatchOnlySim at search time?

>> >> If user switches up their codec then they'll need to ensure it also
>> >> stores stats required by their Sim(s).
>> >
>> > That's backwards, IMO.
>>
>> I'm still baffled.  If I wanna play a movie on my 1080P monitor I'll
>> need to find a movie that was encoded hidef (ie, bluray not dvd).
>>
>> I mean, I don't have to.  DVD content will play fine still... just
>> degraded quality.
>
> Heh.  Consumers hate format wars....

True :)

> In this case, though, we're dealing with software, not DVD hardware, so
> upgrading is a lot easier.  Under the format-follows-Similarity model, the
> relationship between Similarity and posting format is more akin to the
> relationship between a container format like Quicktime and codecs like
> Sorenson 3 or H.264.

I like the first analogy better ;)  Sim (defines how to score docs
against a query) seems mighty important (moreso than a container
format).

> Tweakers will want to go in and monkey with the choice of codec within the
> Quicktime file, but most users will just trust us to use the latest and
> greatest.

True, but the defaults will be good in Lucene, too.

>> > The posting format encoding should be an implementation detail.  The general
>> > user should be expressing their intent as far as how they want the field to
be
>> > scored, and the posting format should flow from that.
>>
>> Maybe it's that it bothers you that with this proposed changed the
>> user makes 2 decisions -- Codec and Sim?
>
> Yes, and it bothers me that users have to know about codecs at all, when in
> the vast majority of cases it doesn't matter because the default is going to
> be the best choice.
>
> Since compression algorithm performance depends on knowing how to exploit
> patterns in the data and sometimes the user will know about patterns that are
> opaque to us, in some circumstances they will be able to select a more
> appropriate codec.  But that's not the common case, as it requires both
> unusual data and an unusually sophisticated user.
>
> What users will be able to tell us is how they want the field to be used, and
> we can use that information to help us optimize.  For example, when a user
> declares that they want a field to be "match-only", we know we don't have to
> write boost bytes, freq or positions, saving space.

Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP"
instead of saying "Field.matchOnly" (or something).  So I do agree
that it'd be better if the API made it clear what the *search* time
impact is of using this advanced Field API.

We get users who are baffled that their phrase queries no longer work
after setting omitTFAP.  (Today it silently returns no results... with
flex you'll get an exception).  Hmm...

>> Ie user will choose PFor or Standard or Pulsing(PFor/Standard) codec, and
>> then separately choose Sim?
>>
>> But these are important choices.  They should be separate.  Why
>> force-bundle them?
>
> Because most of the time the user isn't going to be able to improve on the
> default.

Right but with either approach we'll set good defaults.  Doesn't seem
differentiating here...

>> > Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under
>> > the hood to implement BM25, match-only, boost-per-position or whatever
>> > shouldn't be the user's concern.  As time goes on, we should allow ourselves
>> > the flexibility to use new compression techniques to write new segments.
>>
>> But w/ the proposed change Lucene users will be free to use better
>> codecs?
>
> They could use better codecs under the format-follows-Similarity model, too.
> They'd just have to subclass and override the factory methods that spawn
> posting encoders/decoders.

Ahh, OK so that's how they'd do it.

So... I think we're making a mountain out of a molehill.

In format-follows-Sim, it sounds like that simply means the Sim has a
default codec, but you can override it if you want (and it's the Sim
that "owns" (has the method for) handing out the Codec you'll use).

Whereas in Lucene the same defaulting will take place.  It's just that
Sim won't "own" picking the Codec.

>> Are you worried about proper defaulting?  We'll handle that
>> (under Version).
>
> I don't think it's necessary or desirable to handle this with Version.  A
> codec improvement (say, encoding match-only fields using PFOR instead of
> VInts) would simply trigger an index format number increment, and new segments
> would be written using the latest format.

You're right -- we don't need Version here (assuming there's no
back-compat break).

>> > There's no difference between calling enum.nextPosition() and
>> > positions.next(), is there?
>>
>> Right now it's a 2 step process when you access via attr -- first you
>> ask the enum to next(), then you ask each attr associated w/ that enum
>> for their value.
>
> OK, I think I see where the limitation arises.
>
> In Lucy/KS, we'd just access the positions value as a member variable (direct
> struct access) rather than invoking a method.  By default, struct definitions
> are opaque and thus member vars are inaccessible (to encourage loose
> coupling), but we override that in certain cases for performance.
>
> However, direct struct access requires a direct inheritance guarantee, while
> "attributes" in Lucene only guarantee interface compliance.

OK

> You don't want to use the stronger, more constrictive check, right?

You mean single inheritance?  No.  Because then we hardwire the attrs
to the Codec.  Standard codec should encode whatever attrs the app
hands us... I think.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message