lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Logan Bell <loganb...@gmail.com>
Subject Re: [lucy-dev] Fwd: [lucy-user] num_wanted = $infinity?
Date Fri, 23 Mar 2012 21:52:22 GMT
On Fri, Mar 23, 2012 at 1:33 PM, Marvin Humphrey <marvin@rectangular.com>wrote:

> On Fri, Mar 23, 2012 at 11:48 AM, Logan Bell <loganbell@gmail.com> wrote:
> > Would anyone be opposed if I fleshed out the documentation around the
> > following links to explain a couple patterns that his e-mail chain
> reminded
> > me of when I first started Lucy?
>
> You've identified a common question, all right, and I think addressing it
> in
> our official documentation would be a nice improvement. :)
>
> > The documents in question are:
> >
> http://incubator.apache.org/lucy/docs/perl/Lucy/Search/IndexSearcher.html
> >
> http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/Tutorial/BeyondSimple.html
>
> It might be a little tricky to integrate this into IndexSearcher's
> reference
> docs, so I would advocate either integrating it into the Tutorial, or
> perhaps
> better yet, writing a short Cookbook entry and linking to it from the
> Tutorial.  Not every Cookbook entry has to be as long as CustomQuery or
> CustomQueryParser!
>

+1

>
> > It's not clear how to obtain all documents associated with a query and
> that
> > the num_wanted value defaulted to 10. I would like to give an example of
> > how one might get all results and also update the IndexSearcher
> > documentation to mention that num_wanted is defaulted to 10 (with an
> offset
> > of 0).
>
> The reason we haven't documented this idiom before is because we don't
> really
> want to encourage people to use it -- users should be shunted towards a
> best
> practice of paging through hits.
>
> The memory consumed during search when you say "give me *all* matches"
> scales
> with index size, and can get out of control with large indexes.
>
> Nevertheless, it's such a common question that we ought to make it easy
> to find the answer.
>

Agreed - perhaps with a stern caveat/warning that this is not advocated for
large indexes. Surely paging is what we ultimately want.


>
> > my $doc_count = $searcher->doc_max;
> > my $hits = $searcher->hits(    # returns a Hits object, not a hit count
> >    query      => 'foo',
> >    num_wanted => $doc_count,
> > );
>
> IMO, this code sample would be improved by using "$doc_max" as the variable
> name.  As a matter of coding style, I think it's desirable to associate the
> name of the variable with the name of the method where the value came from.
> But more importantly, "Doc_Count" is actually an IndexReader method which
> does
> something slightly different from "Doc_Max":
>
>    /** Return the maximum number of documents available to the reader,
> which
>     * is also the highest possible internal document id.  Documents which
>     * have been marked as deleted but not yet purged from the index are
>     * included in this count.
>     */
>    public abstract int32_t
>    Doc_Max(IndexReader *self);
>
>    /** Return the number of documents available to the reader, subtracting
>     * any that are marked as deleted.
>     */
>    public abstract int32_t
>    Doc_Count(IndexReader *self);
>
> Doc_Max() is what you want whenever you're allocating space to hold
> document
> numbers, like we are here.
>

Sure, probably a better var name. However this surfaces another question
for myself and potentially for the documentation, is it possible to obtain
all documents excluding the ones marked for deletion?

Thanks!
Logan

>
> Marvin Humphrey
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message