lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Jaen <ashishj...@gmail.com>
Subject Re: Boolean Query: Knowing Which Clauses Matched
Date Wed, 18 Jul 2012 14:22:34 GMT
Will be great if someone can show how to do it..
For my application, I donot care about any score (just vanilla boolean
search is sufficient)

In the mean while, I experimented with some workaround and would like to
share the findings:

Problem details:
On a collection on 10 million documents, I want to run boolean queries.
These boolean queries act as document classifiers for us and there are a
few 1500 such queries (each having about 300 boolean clauses). If a
document matches the query, we want to know which parts of the boolean
queries match the doc (this is a BI application which does text analytics
and we need the counts for each matched boolean clause for statistics
purpose)

As a workaround, I create a filter using the original boolean query, cache
it, and fire each boolean sub-query subsequently. This has given me a lot
of performance gain (these are initial observations, am still evaluating
the performance)


Some pseudo-code
Filter filter = new QueryWrapperFilter(bigBooleanQuery);
CachingWrapperFilter cachingFilter;
cachingFilter = new CachingWrapperFilter(filter);

fire each boolean subQuery with filter...


On Wed, Jul 18, 2012 at 9:25 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> This is possible, using the ScorerVisitor (3.6) / getChildren (4.0).
> You need a custom collector that when it collects a competitive hit,
> visits the sub-scorers of your BooleanQuery and saves away which ones
> matched the current doc.
>
> But this is very expert and there are real challenges (eg not all
> scorers score document-at-a-time) ... would be nice if someone wrote
> up some example code showing how to do it...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Jul 18, 2012 at 7:17 AM, Ashish Jaen <ashishjaen@gmail.com> wrote:
> > Is there a way to know which sub-clause of a boolean query matched in the
> > result document ? Currently I am using searcher.explain() on each of the
> > sub-clause of the boolean query (on each of the documents returned by
> > searcher). However, this is turning out to be very slow as I need to
> > process ALL the documents returned by the query (A typical query returns
> > about 20 thousand documents and my collection has 10 million docs. My
> > application is not a user facing one, so few seconds per query is still
> > acceptable)
> >
> > I was wondering if there is a efficient way to achieve the above which
> > doesnot use explain() (perhaps storing the information about which
> > sub-clause matched a document while searching). Can anyone provide some
> > method to solve this and point to the relevant classes which need to be
> > changed.
> >
> > Thanks,
> > -Ashish
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message