incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: index-time vs search-time Document
Date Sun, 12 Apr 2009 19:40:18 GMT
On Sun, Apr 12, 2009 at 02:36:13PM -0400, Michael McCandless wrote:
> > At search time, the default doc reader returns a HitDoc object, which
> > subclasses Doc and differs in the following ways:
> >
> >  * The HitDoc constructor takes a float "score".
> When you do field-sorted top N collection do you also store the field
> values in each HitDoc?

Hmm, not sure I grok that.  The end result of field-sorted top-N collection
results is a TopDocs object with FieldDoc members.  HitDoc doesn't come into
play until later, when you call IxReader_Fetch_Doc() for each FieldDoc's doc

If that was a braino and you meant "do you also store the score", even when it
wasn't the determining element in the sort order, then yes.

Maybe we ought to be setting the score to NaN if it was truly irrelevant or
never even calculated.

> >  * HitDoc provides the novel methods Set_Score and Get_Score.
> Why Set_Score?  Isn't it fixed once returned?

Actually, if anything, the "score" argument to HitDoc's constructor should be

IndexReader's Fetch_Doc() method doesn't convey a score, so its doc_reader
member can't supply it at construction time.  It has to be set later, using

Here's the code from Hits_Next() which calls Set_Score():

    Hits_next(Hits *self)
        ScoreDoc *score_doc 
            = (ScoreDoc*)VA_Fetch(self->top_docs->score_docs, self->offset);

        if (!score_doc) {
            /** Bail if there aren't any more *captured* hits. (There may 
             * be more total hits.) */
            return NULL;
        else {
            /* Lazily fetch HitDoc, set score. */
            Obj *doc = Searchable_Fetch_Doc(self->searchable,
            if (OBJ_IS_A(doc, HITDOC)) {
                HitDoc *hit_doc = (HitDoc*)doc;
                HitDoc_Set_Score(hit_doc, score_doc->score);

            return doc;

I had to add that OBJ_IS_A test when Searchable_Fetch_Doc() switched over to
returning an Obj rather than a HitDoc.

> This means when Lucy core needs to get the fields out of the docs (eg
> to index their values) it also relies on host code ("get" from its
> hashtable) to get the value?

Actually, what it does at index-time is wrap each host string with a
ViewCharBuf, which then allows most of the core to interact with the value
using the host-agnostic API.  

ViewCharBuf is a CharBuf that doesn't own its string.  It's a bit of a
dangerous class, because the caller has to take responsibility for not changing
the string out from underneath the ViewCharBuf and causing a memory error.

At some point in the future, we need to go over how Inverter and its
Invert_Doc() method work.

> So basically KS/Lucy simply subclasses the index-time document to make
> the HitDoc.  Lucene probably cannot take the same approach since we
> changes the "type" of your fields (when FieldInfos are merged) which
> of course causes confusion.  

Well, you could still return a HitDocument which subclassed Document -- you
just couldn't eliminate the confusion.

> Unless we can improve that... eg I think a "write once" schema could maybe
> work.

Sounds good to me, of course.  :)  But I imagine there will be resistance.
It's ironic how dearly certain Java folks treasure the "flexibility" of this
API, given that Java's such a straightjacket.

FWIW, I think the only use case that's unachievable under global field
semantics is the "slowly evolving type change".  Everything else is doable
with varying degrees of elegance.

Marvin Humphrey

View raw message