lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject Re: lucene algorithm ?
Date Fri, 27 Apr 2012 19:45:09 GMT
Thanks Ralf.

basically you are talking about selectivity of  columns in a  JOIN, right?


but in my above example, "yellow dog", both terms are very common, and both
have long postings lists.

Yang
On Thu, Apr 26, 2012 at 12:17 AM, Ralf Heyde <ralf.heyde@gmx.de> wrote:

> Hi,
>
> i dont know the correct implementation. All that I can say is, that you
> should take a look at query optimization in state-of-the-art database
> systems. The generell solution is to select this part of a query first,
> which reduces the resultset most.
>
> For example you try to search for a term like "I'm looking for a term" -
> each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)).
>
> Term / Frequency
> I'm - often
> looking - normal
> for - stopword (too often and maybe ignored)
> a - stopword (too often and maybe ignored)
> term - rare
>
> So - Lucene might to select all documents, which contain "term", then on
> the "small" resultset looking for documents matching "looking" ... and so
> on.
> For that reason Lucene stores information like Histograms and other
> "things" ...
>
> I hope I gave you a simple example, which Lucene might use.
>
> Greets from Berlin,
>
> Ralf
>
>
> -------- Original-Nachricht --------
> > Datum: Wed, 25 Apr 2012 14:20:10 -0700
> > Von: Yang <teddyyyy123@gmail.com>
> > An: java-user@lucene.apache.org
> > Betreff: Re: lucene algorithm ?
>
> > additionally,  anybody knows roughly (of course the details are a secret,
> > but I guess the main ideas should be
> > common enough these days) how google does fast ranking in cases of
> > multi-term queries with AND ?
> > (if their postings are sorted by PageRank order, then it's understandable
> > that a single term query would quickly return the top-k, but if it's
> > multi-term, they would have to traverse the entire lists to find the
> > insersection set, because the lists are not sorted by docId, as in the
> > Lucene paper case)
> >
> >
> >
> > On Wed, Apr 25, 2012 at 2:13 PM, Yang <teddyyyy123@gmail.com> wrote:
> >
> > > I read the paper by Doug "Space optimizations for total ranking",
> > >
> > > since it was written a long time ago, I wonder what algorithms lucene
> > uses
> > > (regarding postings list traversal and score calculation, ranking)
> > >
> > >
> > > particularly the total ranking algorithm described there needs to
> > traverse
> > > down the entire postings list for all the query terms,
> > > so in case of very common query terms like "yellow dog", either of the
> 2
> > > terms may have a very very long postings list in case of web search,
> > > are they all really traversed in current lucene/Solr ? or  any
> > heuristics
> > > to truncate the list are actually employed?
> > >
> > > in the case of returning top-k results, I can understand that
> > partitioning
> > > the postings list into multiple machines, and then combining the  top-k
> > > from each would work,
> > > but if we are required to return "the 100th result page", i.e. results
> > > ranked from 990--1000th, then each partition would still have to find
> > out
> > > the top 1000, so
> > > partitioning would not help much.
> > >
> > >
> > > overall, is there any up-to-date detailed docs on the internal
> > algorithms
> > > of lucene?
> > >
> > > Thanks a lot
> > > Yang
> > >
>
> --
> Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
> belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message