lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject Re: problems with DisjunctionMaxQuery and early-termination
Date Fri, 16 Mar 2012 18:29:22 GMT
On Fri, Mar 16, 2012 at 8:38 PM, Carlos Gonzalez-Cadenas <
cgc@experienceon.com> wrote:

> On Fri, Mar 16, 2012 at 9:26 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Hello Carlos,
>>
>
>> so, search all terms with MUST first, you've got the best result in terms
>> of precision and recall if you've got something. Otherwise you still have a
>> lot of time. You need to drop one of the words or switch ones of them into
>> SHOULD.
>
>
> Agree, this is precisely what we're trying to do (the idea of having
> multiple queries, from narrow to broad). My question was more of a
> practical nature, that is, how can we do these queries without really
> having to do independent SOLR queries.
>

Sorry I forget this question. I did a tiny DelegateRequestHandler
which sequentially iterates through list of other request handlers until
the first not empty results has been found. Every of slave RequestHandlers
has own QParser params e.g. I search for conjunction, then apply spelling
correction, and after that I search for disjunction. You can see from my
"theory" that the total time is equal to the time of the last successful
search.


> Now we use DisjunctionMaxQuery, but it has the problems that I described
> in my former email w.r.t. early-termination.
>
> This morning we found two potential directions that might work (we're
> testing them as of now):
>
>    1. Implement a custom RequestHandler and execute several queries
>    within SOLR (https://issues.apache.org/jira/browse/SOLR-1093). This is
>    better than executing them from outside and having all the network / HTTP /
>    ... overhead, but still not very good.
>    2. Modify DisjunctionMaxQuery. In particular, modifying DisjunctionMaxScorer
>    so that it doesn't use a min heap for the subscorers. We'll try several
>    strategies to collect documents from the child subscorers, like round-robin
>    or collecting the narrower subscorers first and then go broader until the
>    upstream collector stops the collection. This looks like the most
>    interesting option.
>
>
> Enumerating all combinations is NPcomplete task I believe. But you have a
>> good heuristics:
>> * zero docFreq means that you can drop this term off or pass it through
>> spell correction
>> * if you have a instant suggest like app and has zero result for some
>> phrase, maybe dropping the last word gives you the phrase which had some
>> results before, and present in cache.
>> * otherwise excluding less frequent term from conjunction probably gives
>> non-zero results
>>
>
> This is not a problem in practice. We're using a bunch of heuristics in
> our QueryParser (including a lot of info extracted from the TermsEnum,
> stopword lists, etc ...) to severely cut the space.
>
> Thanks
> Carlos
>
>
>
>>
>> Regards
>>
>>
>> On Thu, Mar 15, 2012 at 12:01 AM, Carlos Gonzalez-Cadenas <
>> cgc@experienceon.com> wrote:
>>
>>> Hello all,
>>>
>>> We have a SOLR index filled with user queries and we want to retrieve the
>>> ones that are more similar to a given query entered by an end-user. It is
>>> kind of a "related queries" system.
>>>
>>> The index is pretty big and we're using early-termination of queries
>>> (with
>>> the index sorted so that the "more popular" queries have lower docids and
>>> therefore the termination yields higher-quality results)
>>>
>>> Clearly, when the user enters a user-level query into the search box,
>>> i.e.
>>> "cheap hotels barcelona offers", we don't know whether there exists a
>>> document (query) in the index that contains these four words or not.
>>>  Therefore, when we're building the SOLR query, the first intuition would
>>> be to do a query like this "cheap OR hotels OR barcelona OR offers".
>>>
>>> If all the documents in the index where evaluated, the results of this
>>> query would be good. For example, if there is no query in the index with
>>> these four words but there's a query in the index with the text "cheap
>>> hotels barcelona", it will probably be one of the top results, which is
>>> precisely what we want.
>>>
>>> The problem is that we're doing early termination and therefore this
>>> query
>>> will exhaust very fast the top-K result limit (our custom collector
>>> limits
>>> on the number of evaluated documents), given that queries like "hotels in
>>> madrid" or "hotels in NYC" will match the OR expression described above
>>> (because they all match "hotels").
>>>
>>> Our next step was to think in a DisjunctionMaxQuery, trying to write a
>>> query like this:
>>>
>>> DisjunctionMaxQuery:
>>>  1) +cheap +hotels +barcelona +offers
>>>  2) +cheap +hotels +barcelona
>>>  3) +cheap +hotels
>>>  4) +hotels
>>>
>>> We were thinking that perhaps the sub-queries within the
>>> DisjunctionMaxQuery were going to get evaluated in "parallel" given that
>>> they're separated queries, but in fact from a runtime perspective it does
>>> behave in a similar way than the OR query that we described above.
>>>
>>> Our desired behavior is to try match documents with each subquery within
>>> the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
>>> score them and return them all together (therefore we don't want all the
>>> matches being done by a single sub-query, like it's happening now).
>>>
>>> Clearly, we could create a script external to SOLR that just runs the
>>> several sub-queries as standalone queries and then joins all the results
>>> together, but before going for this we'd like to know if you have any
>>> ideas
>>> on how to solve this problem within SOLR. We do have our own QParser, and
>>> therefore we'd be able to implement any arbitrary query construction that
>>> you can come up with, or even create a new Query type if it's needed.
>>>
>>> Thanks a lot for your help,
>>> Carlos
>>>
>>>
>>> Carlos Gonzalez-Cadenas
>>> CEO, ExperienceOn - New generation search
>>> http://www.experienceon.com
>>>
>>> Mobile: +34 652 911 201
>>> Skype: carlosgonzalezcadenas
>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Lucid Certified
>> Apache Lucene/Solr Developer
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mkhludnev@griddynamics.com>
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message