incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: index-time vs search-time Document
Date Mon, 13 Apr 2009 13:55:52 GMT
On Sun, Apr 12, 2009 at 3:40 PM, Marvin Humphrey <> wrote:
> On Sun, Apr 12, 2009 at 02:36:13PM -0400, Michael McCandless wrote:
>> > At search time, the default doc reader returns a HitDoc object, which
>> > subclasses Doc and differs in the following ways:
>> >
>> >  * The HitDoc constructor takes a float "score".
>> When you do field-sorted top N collection do you also store the field
>> values in each HitDoc?
> Hmm, not sure I grok that.  The end result of field-sorted top-N collection
> results is a TopDocs object with FieldDoc members.  HitDoc doesn't come into
> play until later, when you call IxReader_Fetch_Doc() for each FieldDoc's doc
> num.

In Lucene, as a side effect of sorting by field value, we also return
to you the actual values of those fields for each hit.
(TopDocs.scoreDocs is an array of FieldDoc).  We didn't have to do
this; one could go and retrieve those values the "normal" way.  So I
was wondering if KS/Lucy does this (sounds like yes).

In KS/Lucy when you get the TopDocs back, you still must call
IxReader_Fetch_Doc to retrieve the HitDoc: what do you pass that
function?  (Can't just be docID, since you fold score into it).

> If that was a braino and you meant "do you also store the score", even when it
> wasn't the determining element in the sort order, then yes.
> Maybe we ought to be setting the score to NaN if it was truly irrelevant or
> never even calculated.

We've splintered out all the possibilities in Lucene now... on quick
testing we seem to save alot of CPU not computing the score per hit if
it doesn't participate in sort (or isn't needed, for maxScore).  We
can do even better, by pushing the "bottomValue" comparison down to
individual TermScorers (like what we do w/ deletes and should do w/
random-access filters).

>> >  * HitDoc provides the novel methods Set_Score and Get_Score.
>> Why Set_Score?  Isn't it fixed once returned?
> Actually, if anything, the "score" argument to HitDoc's constructor should be
> eliminated.
> IndexReader's Fetch_Doc() method doesn't convey a score, so its doc_reader
> member can't supply it at construction time.  It has to be set later, using
> Set_Score().

Ahhh got it.  I still don't understand how Searchable_Fetch_Doc knows
its supposed to return a HitDoc vs a normal Doc?  Oh maybe it simply
always returns a HitDoc?  Hmm.

> Here's the code from Hits_Next() which calls Set_Score():
>    Obj*
>    Hits_next(Hits *self)
>    {
>        ScoreDoc *score_doc
>            = (ScoreDoc*)VA_Fetch(self->top_docs->score_docs, self->offset);
>        self->offset++;
>        if (!score_doc) {
>            /** Bail if there aren't any more *captured* hits. (There may
>             * be more total hits.) */
>            return NULL;
>        }
>        else {
>            /* Lazily fetch HitDoc, set score. */
>            Obj *doc = Searchable_Fetch_Doc(self->searchable,
>                score_doc->doc_num);
>            if (OBJ_IS_A(doc, HITDOC)) {
>                HitDoc *hit_doc = (HitDoc*)doc;
>                HitDoc_Set_Score(hit_doc, score_doc->score);
>            }
>            return doc;
>        }
>    }
> I had to add that OBJ_IS_A test when Searchable_Fetch_Doc() switched over to
> returning an Obj rather than a HitDoc.

Hmm -- what does that mean if one overrides what's returned from
Searchable_Fetch_Doc, and one also wants score baked into it?

>> This means when Lucy core needs to get the fields out of the docs (eg
>> to index their values) it also relies on host code ("get" from its
>> hashtable) to get the value?
> Actually, what it does at index-time is wrap each host string with a
> ViewCharBuf, which then allows most of the core to interact with the value
> using the host-agnostic API.
> ViewCharBuf is a CharBuf that doesn't own its string.  It's a bit of a
> dangerous class, because the caller has to take responsibility for not changing
> the string out from underneath the ViewCharBuf and causing a memory error.

Ahh so each host wrapper must provide a ViewCharBuf for efficient
access to a hosty string w/o copying the string.

> At some point in the future, we need to go over how Inverter and its
> Invert_Doc() method work.

That sounds like fun ;)

>> So basically KS/Lucy simply subclasses the index-time document to make
>> the HitDoc.  Lucene probably cannot take the same approach since we
>> changes the "type" of your fields (when FieldInfos are merged) which
>> of course causes confusion.
> Well, you could still return a HitDocument which subclassed Document -- you
> just couldn't eliminate the confusion.

Right, that's the weak typing wreaking havoc.

>> Unless we can improve that... eg I think a "write once" schema could maybe
>> work.
> Sounds good to me, of course.  :)  But I imagine there will be resistance.
> It's ironic how dearly certain Java folks treasure the "flexibility" of this
> API, given that Java's such a straightjacket.

Yes I find it interesting and unusual, too.  But I would think as long
as we keep UntypedFieldType for such uber-flexibility, t'sall good.

> FWIW, I think the only use case that's unachievable under global field
> semantics is the "slowly evolving type change".  Everything else is doable
> with varying degrees of elegance.

I think that's what I'm calling "weak typing".


View raw message