incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Fwd: [lucy-user] num_wanted = $infinity?
Date Fri, 23 Mar 2012 20:33:40 GMT
On Fri, Mar 23, 2012 at 11:48 AM, Logan Bell <loganbell@gmail.com> wrote:
> Would anyone be opposed if I fleshed out the documentation around the
> following links to explain a couple patterns that his e-mail chain reminded
> me of when I first started Lucy?

You've identified a common question, all right, and I think addressing it in
our official documentation would be a nice improvement. :)

> The documents in question are:
> http://incubator.apache.org/lucy/docs/perl/Lucy/Search/IndexSearcher.html
> http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/Tutorial/BeyondSimple.html

It might be a little tricky to integrate this into IndexSearcher's reference
docs, so I would advocate either integrating it into the Tutorial, or perhaps
better yet, writing a short Cookbook entry and linking to it from the
Tutorial.  Not every Cookbook entry has to be as long as CustomQuery or
CustomQueryParser!

> It's not clear how to obtain all documents associated with a query and that
> the num_wanted value defaulted to 10. I would like to give an example of
> how one might get all results and also update the IndexSearcher
> documentation to mention that num_wanted is defaulted to 10 (with an offset
> of 0).

The reason we haven't documented this idiom before is because we don't really
want to encourage people to use it -- users should be shunted towards a best
practice of paging through hits.

The memory consumed during search when you say "give me *all* matches" scales
with index size, and can get out of control with large indexes.

Nevertheless, it's such a common question that we ought to make it easy
to find the answer.

> my $doc_count = $searcher->doc_max;
> my $hits = $searcher->hits(    # returns a Hits object, not a hit count
>    query      => 'foo',
>    num_wanted => $doc_count,
> );

IMO, this code sample would be improved by using "$doc_max" as the variable
name.  As a matter of coding style, I think it's desirable to associate the
name of the variable with the name of the method where the value came from.
But more importantly, "Doc_Count" is actually an IndexReader method which does
something slightly different from "Doc_Max":

    /** Return the maximum number of documents available to the reader, which
     * is also the highest possible internal document id.  Documents which
     * have been marked as deleted but not yet purged from the index are
     * included in this count.
     */
    public abstract int32_t
    Doc_Max(IndexReader *self);

    /** Return the number of documents available to the reader, subtracting
     * any that are marked as deleted.
     */
    public abstract int32_t
    Doc_Count(IndexReader *self);

Doc_Max() is what you want whenever you're allocating space to hold document
numbers, like we are here.

Marvin Humphrey

Mime
View raw message