incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: index-time vs search-time Document
Date Sun, 12 Apr 2009 18:36:13 GMT
On Sun, Apr 12, 2009 at 8:40 AM, Marvin Humphrey <> wrote:

>> How does/will KS/Lucy handle documents at search time?  What is the planned API?
> I think the only aspect of Lucy documents that's been discussed is that they
> will be hash-based rather than array-based.  I don't recall having thrashed
> out the index-time vs. search-time issues.  I can tell you what KS trunk is
> doing, though.


>> We struggle with this in Lucene because we use the same class
>> (Document) to represent the document at index time and search time,
>> yet, many details of the document are not preserved properly, and it
>> caused plenty of problems.
> At search time, the default doc reader returns a HitDoc object, which
> subclasses Doc and differs in the following ways:
>  * The HitDoc constructor takes a float "score".

When you do field-sorted top N collection do you also store the field
values in each HitDoc?

>  * HitDoc provides the novel methods Set_Score and Get_Score.

Why Set_Score?  Isn't it fixed once returned?

>  * Set_Boost and Get_Boost throw errors on a HitDoc.

Ahhh... that's good.

> In addition, I recently made an experimental change:  IxReader_Fetch_Doc(),
> Searchable_Fetch_Doc() and Hits_Next() all return an Obj rather than a HitDoc.
> This enables greater flexibility for alternative DocReader implementations: it
> allowed ByteBufDocReader to return a ByteBuf (which gets converted to a plain
> old Perl scalar at the C/Perl boundary).


>> Much of this comes down to the fixed-vs-malleable schema difference,
>> so KS/Lucy can do a better job preserving eg the FieldSpec used at
>> indexing time, but there must still be things (eg field & doc level
>> boost)?
> Field-level boost is set via a member variable in FieldSpec.  This is slightly
> less flexible than Lucene, but covers the most common use case of e.g.
> weighting "title" more heavily or "internal_id" less heavily, and avoids the
> index-time vs. search-time field boost retrieval problem.

Hmmm OK.

> Maybe we ought to have the indexer class's Add_Doc() take second "boost"
> argument, which defaults to 1.0?  That models what happens internally more
> closely and it allows us to eliminate the boost member from Doc altogether.


> Here are the ways that Add_Doc() could be used with the Perl bindings if we make
> that call:
>   my %fields = ( title => 'foo', content => 'bar' );
>   $indexer->add_doc(\%fields);
>   $indexer->add_doc( doc => \%fields );
>   $indexer->add_doc( doc => \%fields, boost => 2.5 );
>   my $doc = Lucy::Doc->new( fields => { title => 'foo', content => 'bar'
} );
>   $indexer->add_doc($doc);
>   $indexer->add_doc( doc => $doc );
>   $indexer->add_doc( doc => $doc, boost => 2.5 );
> Note: Doc's internal "fields" member is set up as a void* which the
> host will fill in with its own hash table implementation.  This complicates
> the internals, but makes for a more "hosty", user-friendly API, and also
> increases performance by minimizing memory copies when the user sets/gets
> field values.

I like that new word, hosty :)

This means when Lucy core needs to get the fields out of the docs (eg
to index their values) it also relies on host code ("get" from its
hashtable) to get the value?

So basically KS/Lucy simply subclasses the index-time document to make
the HitDoc.  Lucene probably cannot take the same approach since we
changes the "type" of your fields (when FieldInfos are merged) which
of course causes confusion.  Unless we can improve that... eg I think
a "write once" schema could maybe work.


View raw message