lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Sequeira <antony.seque...@gmail.com>
Subject Re: performance question - number of documents
Date Mon, 24 Oct 2011 05:33:48 GMT
This may not be directly relevant to Lucene, but I wanted to learn:

How does a web search engine do something like this.
Do they also "score every matching document on every query" OR
do they pick a subset first based on some static/offlline ranking criteria
then do what Lucene does OR
do they search and find every matching document, pick a subset of the
results based on a static ranking and then score that subset based on the
query terms.

I guess the assumption I am making is that it's not practical to "score
every matching document on every query" at www scale.
May be that assumption is wrong.

I also haven't understood how search scales :(

-Antony

On Sun, Oct 23, 2011 at 10:18 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> "Why would it matter...top 5 matches" Because Lucene has to calculate
> the score of all documents in order to insure that it returns those 5
> documents.
> What if the very last document scored was the most relevant?
>
> Best
> Erick
>
> On Sun, Oct 23, 2011 at 3:06 PM, sol myr <solmyr72@yahoo.com> wrote:
> > Hi,
> >
> > We've noticed some Lucene performance phenomenon, and would appreciate an
> explanation from anyone familiar with Lucene internals
> >
> > (I know Lucene as a user, but haven't looked under its hood).
> >
> > We have a Lucene index of about 30 million records.
> > We ran 2 queries: "AND" and "OR" ("+john +doe" versus "john doe").
> > The AND query had much better performance (AND takes about 500 millis,
> while OR takes about 2000 millis).
> >
> > We wondered whether this has anything to do with the number of potential
> matches?
> > Our AND has only about 5000 matches (5000 documents contain *both* "john"
> and "doe").
> > Our OR has about 8 million matches (8 million documents contain *either*
> "john" or "doe").
> >
> >
> > Does this explain the performance difference?
> > But why would it matter, as long as we take only the top 5 matches (
> indexSearcher.search(query, 5))...?
> > Is there any other explanation?
> >
> > Thanks :)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message