lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "FilterQueryGuidance" by MichaelLudwig
Date Fri, 12 Jun 2009 13:36:42 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by MichaelLudwig:
http://wiki.apache.org/solr/FilterQueryGuidance

The comment on the change is:
Guidance on using filters for increased efficiency

New page:
= Guidance on Using Filters for Increased Efficiency =

(The following is mainly reused content from the thread starting at
[http://markmail.org/message/ekmpwluxqfvbnhvx Re: fq vs. q - Michael Ludwig]
on the org.apache.lucene.solr-user mailing list.)

== What goes in the `q` parameter and what goes in the `fq` parameter? ==

How do I decide, when writing a query,
what criteria goes in the `q` parameter and what goes in the `fq` parameter,
to achieve optimal performance.
Is there some kind of rule of thumb to help me decide
how to split things up
when querying against one or more fields?

== Understanding Solr's caching system ==

First, some background is necessary to understand SolrCaching,
notably the `queryResultCache` and the `filterCache`.
You should be familiar with the different caches Solr has
in order to make informed decisions on filter usage.

You will then know that a filter query
results in a filter (implemented as a bit vector),
which is cached,
which means that it is the more useful the more often it is repeated.

Now, what is a filter query?
It is simply a part of a query that is factored out for special treatment.
This is achieved in Solr by specifying it using the `fq` (filter query) parameter
instead of the `q` (main query) parameter.
The same result could be achieved leaving that query part in the main query.
The difference will be in query efficiency.
That's because the result of a filter query is cached
and then used to filter a primary query result using set intersection.

So how can we put this to practical use?
Well, we have to decompose our queries
and take note of the frequency of each query part.
If we know how often certain query parts arise,
or at least have the means to collect that data,
we know what might be candidates for filtering.

== Thinking about our queries analytically ==

Now how do we know what query parts recur frequently?

Well, we know the application we're writing,
so we either know the frequency of a given query part
based on the usage our application makes of Solr
and on the restrictions it imposes on the user by, say, using DisMaxRequestHandler;
or - if we give the user fine-grained control over the query language -
we may somehow collect and analyze the actual queries
in order to empirically determine
actual search engine usage and query part frequency
and optimize accordingly.

Anyway, we need to analyze, to decompose the queries we want our system to handle.
We'll then know the query parts and the frequency of the various combinations,
and we'll then see what are good filter query candidates.

== An example to illustrate the greater efficiency obtainable by filtering ==

Filtering a given query result `R`
on `bla:eins`, `bla:zwei`, `bla:drei` or `bla:vier`
is very common in my application.
So while I could include this criterion in my main query (`q`)
and hope for the `queryResultCache` to kick in,
this would likely be inefficient
as my primary query, which gave me `R`, likely varies a lot,
resulting in a high number of distinct queries,
with relatively low probability for a given query to occur frequently.
So each of these query result sets would enter the `queryResultCache` as a distinct set,
hence high contention, high eviction rate, poor cache efficiency.

Enter filter queries.
I'm going to factor out those `bla:eins` (etc) filters from my primary query (`q`)
and put them in the filter query (`fq`).
The benefit is double:

(1) Solr has a dedicated cachespace for filters
the usage of which I control by my usage of the filter query (`fq`).
I can set up things so the usage of the primary query (`q`) is under the user's control
while the usage of the filter query (`fq`) is under my application's control.
I control this cache, I ensure its efficiency,
by allowing only frequently used filters to enter the cache,
and by not allowing so many filters access
that high contention and eviction in the `filterCache` would ensue.

(2) Factoring out the filter query `bla:eins` (etc) from the primary query
also reduces variation in the primary query,
thus making the `queryResultCache` more efficient.

So instead of having, say, 10000 distinct primary queries,
no usage of the `filterCache`,
and poor usage of the `queryResultCache`,
I may have only, say, 3000 distinct primary queries,
four cached filters in the `filterCache` (`bla:eins` etc),
and a somewhat better usage of the `queryResultCache`.

== Stray bits ==

Memory consumption per filter field value is not a great concern here
as the `filterCache` typically (perhaps always) stores only bit vectors,
each bit representing a boolean to signal whether or not the document in question
is a member of the set matching the filter specification.
Document reference is implicit by each bit's position in the vector;
this is by virtue of the fact that Solr's internal document IDs
(which are different from the document IDs the user may assign
via the `<uniqueKey>` in `schema.xml`)
are a sequence of consecutive integers.

If my filter query result comprises
more than 50 % of the entire document collection,
its selectivity is poor.
I might need it despite this fact,
but it might also be worth while thinking about how to reframe the requirement,
allowing for more efficient filters.

What varies heavily should probably ''not'' go into the `filterCache`.
For example, a geodata search window (longitude and latude)
varying over a huge valuespace with each user action
looks like a candidate for the main query (`q`),
whereas some other criterion not subject to such frequent change
and relatively limited in its valuespace
looks like a candidate for the filter query (`fq`).

== Configuring the `filterCache` ==

If I know that only 100 filters are possible,
there is no point raising the `filterCache/@size` above that threshold.
But it may not be harmful either.

Given the following three filtering scenarios of
(a) `x:bla`, (b) `y:blub`, and (c) `x:bla AND y:blub`,
will I end up with two or three distinct filters?
In other words, may filters be composites or are they decomposed
as far as their number (relevant for `filterCache/@size`) is concerned?
In this example, (a), (b) and (c) are three distinct filters.
If, however, (c) was specified using two distinct `fq` parameters `x:bla` and `y:blub`
I'd end up with only two distinct filters for (a), (b) and (c).

== What happens when the filter is full? ==

What happens when the filter is full?
If there any accounting of which cache entries are getting the most or most recent hits?
A good question, which remains to be answered.

Mime
View raw message