lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Costi Muraru <costimur...@gmail.com>
Subject Re: Evaluate function only on subset of documents
Date Tue, 24 Jun 2014 12:25:26 GMT
Thanks guys for your answers.
Sorry for the query syntax errors I've added in the previous queries.

Chris, you've been really helpful. Indeed, point 3 is the one I'm trying to
solve, rather than 2.
You're saying that "BooleanScorer will consult the clauses in order based
on which clause
says it can "skip" the most documents".
I think this might be the culprit for me.

Let's take this query sample:
XXX OR AAA AND {!frange ...}

For my use case:
AAA returns a subset of 100k documents.
frange returns 5k documents, all part of these 100k documents.

Therefore, frange skips the most documents. From what you are saying,
frange is going to be applied on all documents (since it skips the most
documents) and AAA is going to be applied on the subset. This is kind of
what I've originally noticed. My goal is to have this in reverse order,
since frange is much more expensive than AAA.
I was hoping to do so by specifying the cost, saying that "Hey, frange has
cost 100 while AAA has cost 1", so run AAA first and then run frange on the
subset. However this does not seem to be taken into consideration.
Does this make sense / Am I getting something wrong? Is there something I
can do to achieve this?

Thanks,
Costi


On Tue, Jun 24, 2014 at 4:23 AM, Chris Hostetter <hossman_lucene@fucit.org>
wrote:

> : Now, if I want to make a query that also contains some OR, it is
> impossible
> : to do so with this approach. This is because fq with OR operator is not
> : supported (SOLR-1223). As an alternative I've tried these queries:
> :
> : county='New York' AND (location:Maylands OR location:Holliscort or
> : parking:yes) AND_val_:"{!frange u=0 cost=150
> cache=false}mycustomfunction()"
>
> 1) most of the examples you've posted have syntax errors in them that are
> probably throwing a wrench into your testing.  in this example county='New
> York' is not valid syntax, presumably you want conty='New Your'
>
> 2) based on the example you give, what you're trying to do here doesn't
> really depend on using "SHOULD" (ie: OR) type logic against the frange:
> the only disjunction you have is in a sub-query of a top level
> conjunction (e: all required) ... the frange itself is still mandatory.
>
> so you could still use it as a non-cached postfilter just like in your
> previous example:
>
> q=+XXX +(YYY ZZZ)&fq={!frange cost=150 cache=false ...}
>
>
> 3) if that query wasn't exactly what you ment, and your top level query is
> more complex, containing a mix of MUST, MUST_NOT, and SHOULD clauses, ie:
>
> q=+XXX YYY ZZZ -AAA +{!frange ...}
>
> ...then the internal behavior of BooleanQuery will automatically do what
> you want (no need for cache or cost params on the fq) to the best
> of it's ability because of how the evaluation of boolean clauses are
> "re-ordered" internally based on the "next" match.
>
> it's kind of complicated to explain, but the short version is:
>
> a) BooleanScorer will avoid asking any clause if it matches a document
> which has already been disqualified by another clause
> b) BooleanScorer will consult the clauses in order based on which clause
> says it can "skip" the most documents
>
> So you migght see your custom function evaluated for some docs that
> ultimately don't match, but if there are more "rare" mandatory clauses
> of your BQ that tell Lucene it can skip over a large number of docs
> then, your custom function will be skipped.
>
> This is how BooleanQuery has always worked, but i just committed a test to
> verify it even when wrapping a FunctionRangeQuery...
>
> https://svn.apache.org/r1604990
>
>
> 4) the extreme of #3 is that if you need to use the {!frange} as part of
> a full disjunction, ie:
>
>    q=XXX OR YYY OR {!frange ...}
>
> ...then it would be impossible for Solr to only execute the expensive
> function against the subset of documents that match the query -- because
> BooleanScorer won't be able to tell which documents match the query unless
> it evaluates the function (it's a catch-22).   even if every doc does not
> match either XXX or YYY, solr has to evaluate the function against every
> doc to see if that function *makes* the document match the entire query.
>
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message