lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <st...@designware.dk>
Subject Re: How does query on <few-hits> AND <many-hits> work
Date Mon, 26 May 2014 11:49:16 GMT
Do not know if this is a special-case. I guess an AND-query where one 
side hits 500-1000 and the other side hits billions is a special-case. 
But this way of carrying out the query might also be an optimization in 
less uneven cases.
It does not require that the "lots of hits"-part of the query is a 
range-query, and it does not necessarily require that the field used in 
this part is DocValue (you can go fetch the values from "slow" store). 
But I guess it has to be a very uneven case if this approach should be 
faster on a non-DocValue field.

I think this can be generalized. I think of it as something similar as 
being able to "hint" relational databases not to use an specific index. 
I do not know that much about Solr/Lucene query-syntax, but I believe 
"filter-queries" (fq) are kinda queries that will be AND'ed onto the 
real query (q), and in order not to have to change the query-syntax too 
much (adding hits or something), I guess a first step for a feature 
doing what I am doing here, could be introduce something similar to 
"filter-queries" - queries that will be carried out on the result of (q 
+ fqs) but looking a the values of the documents in that result instead 
of intersecting with doc-sets found from index. Lets call it 
"post-query-value-filter"s (yes, we can definitely come up with a 
better/shorter name)

1) q=no_dlng_doc_ind_sto:(<NO>) AND 
timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
2) 
q=no_dlng_doc_ind_sto:(<NO>),fq=timestamp_dlng_doc_ind_sto:([<TIME_START> TO 
<TIME_END>])
3) 
q=no_dlng_doc_ind_sto:(<NO>),post-query-value-filter=timestamp_dlng_doc_ind_sto:([<TIME_START>

TO <TIME_END>])

1) and 2) both use index on both no_dlng_doc_ind_sto and 
timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto 
and does the time-interval filter part by fetching values (using 
DocValue if possible) for timestamp_dlng_doc_ind_sto for each of the 
docs found through the no_dlng_doc_ind_sto-index to see if this doc 
should really be included.

There are some things that I did not initially tell about actually 
wanting to do a facet search etc. Well, here is the full story: 
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html

Regards, Per Steffensen

On 23/05/14 17:37, Toke Eskildsen wrote:
> Per Steffensen [steff@designware.dk] wrote:
>> * It IS more efficient to just use the index for the
>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>> the query.
> Thank you for the follow up. It sounds rather special-case though, with requirement of
DocValues for the range-field. Do you think this can be generalized?
>
> - Toke Eskildsen
>


Mime
View raw message