incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: index-time vs search-time Document
Date Sun, 12 Apr 2009 12:40:34 GMT
On Sun, Apr 12, 2009 at 05:24:58AM -0400, Michael McCandless wrote:
> How does/will KS/Lucy handle documents at search time?  What is the planned API?

I think the only aspect of Lucy documents that's been discussed is that they
will be hash-based rather than array-based.  I don't recall having thrashed
out the index-time vs. search-time issues.  I can tell you what KS trunk is
doing, though.

> We struggle with this in Lucene because we use the same class
> (Document) to represent the document at index time and search time,
> yet, many details of the document are not preserved properly, and it
> caused plenty of problems.

At search time, the default doc reader returns a HitDoc object, which
subclasses Doc and differs in the following ways:

  * The HitDoc constructor takes a float "score".
  * HitDoc provides the novel methods Set_Score and Get_Score.
  * Set_Boost and Get_Boost throw errors on a HitDoc.

In addition, I recently made an experimental change:  IxReader_Fetch_Doc(),
Searchable_Fetch_Doc() and Hits_Next() all return an Obj rather than a HitDoc.
This enables greater flexibility for alternative DocReader implementations: it
allowed ByteBufDocReader to return a ByteBuf (which gets converted to a plain
old Perl scalar at the C/Perl boundary).

> Much of this comes down to the fixed-vs-malleable schema difference,
> so KS/Lucy can do a better job preserving eg the FieldSpec used at
> indexing time, but there must still be things (eg field & doc level
> boost)?

Field-level boost is set via a member variable in FieldSpec.  This is slightly
less flexible than Lucene, but covers the most common use case of e.g.
weighting "title" more heavily or "internal_id" less heavily, and avoids the
index-time vs. search-time field boost retrieval problem.

Maybe we ought to have the indexer class's Add_Doc() take second "boost"
argument, which defaults to 1.0?  That models what happens internally more
closely and it allows us to eliminate the boost member from Doc altogether.

Here are the ways that Add_Doc() could be used with the Perl bindings if we make
that call:

   my %fields = ( title => 'foo', content => 'bar' );
   $indexer->add_doc( doc => \%fields );
   $indexer->add_doc( doc => \%fields, boost => 2.5 );

   my $doc = Lucy::Doc->new( fields => { title => 'foo', content => 'bar' } );
   $indexer->add_doc( doc => $doc );
   $indexer->add_doc( doc => $doc, boost => 2.5 );

Note: Doc's internal "fields" member is set up as a void* which the
host will fill in with its own hash table implementation.  This complicates
the internals, but makes for a more "hosty", user-friendly API, and also
increases performance by minimizing memory copies when the user sets/gets
field values.

Marvin Humphrey

View raw message