lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khash Sajadi <kh...@sajadi.co.uk>
Subject Re: Using filters to speed up queries
Date Wed, 27 Oct 2010 15:38:26 GMT
Thanks everyone for your help.

At the end, I settled for using the Constant Score Query for the ACCOUNT and
cached filter for the date range. The performance on a 20mm document index
with 500 accounts is awesome!



On 25 October 2010 11:28, Michael McCandless <lucene@mikemccandless.com>wrote:

> Here's the paper I was thinking of (Robert found this):
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 ...
> eg note this sentence from the abstract:
>
>    We show that the first implementation, based on a postprocessing
> approach, allows an arbitrary user to obtain information about the
> content of files for which he does not have read permission.
>
> Note that one simple way to "gauge" performance of filtering-down-low
> would be to open an IndexReader, delete all documents except those
> matching your filter (eg the ACCOUNT filter), then run your searches
> against that IndexReader without the ACCOUNT clause.  If you don't
> close that reader then these deletes are never committed.  This is a
> simple way to compile in a filter to an open IR, but, you'd still then
> have one reader open per user class, so the risks of too many files,
> etc still stands.
>
> Hmm though you could open an initial reader, then clone it, then do
> all your deletes on that clone for user class 1, then clone it again,
> do all deletes on that clone for user class 2.  This way you only have
> one set of open files, but you've "compiled" your filter into the
> delete docs for each reader.
>
> But, in order to do this, you'd have to disable locking (use
> NoLockFactory) in your Directory impl, just for these readers, since
> you know you'll never commit the readers with pending deletions.  Just
> be sure you never close those readers!
>
> This should give sizable speedups if the filter is non-sparse.
>
> Mike
>
> On Sun, Oct 24, 2010 at 6:34 AM, Khash Sajadi <khash@sajadi.co.uk> wrote:
> > Here is what I've found so far:
> >
> > I have three main sets to use in a query:
> > Account MUST be xxx
> > User query
> > DateRange on the query MUST be in (a,b) it is a NumericField
> > I tried the following combinations (all using a BooleanQuery with the
> user
> > query added to it)
> > 1. One:
> > - Add ACCOUNT as a TermQuery
> > - Add DATE RANGE as Filter
> > 2. Two
> > - Add ACCOUNT as Filer
> > - Add DATE RANGE as NumericRangeQuery
> > I tried caching the filters on both scenarios.
> > I also tried both scenarios by passing the query as a ConstantScoreQuery
> as
> > well.
> > I got the best result (about 4x faster) by using a cached filter for the
> > DATE RANGE and leaving the ACCOUNT as a TermQuery.
> > I think I'm happy with this approach. However, the security risk Uwe
> > mentioned when using ACCOUNT as a Query makes me nervous. Any
> suggestions?
> > As for document distribution, the ACCOUNTS have a similar distribution of
> > documents.
> > Also, I still would like to try the multi index approach, but not sure
> about
> > the memory, file handle burden of it (having potentially thousands of
> > reades/writers/searchers) open at the same time. I use two processes one
> as
> > indexer and one for search with the same underlying FSDirectory. As for
> > search, I use writer.getReader().reopen within a SearchManager as
> suggested
> > by Lucene in Action.
> >
> >
> >
> > On 24 October 2010 10:27, Paul Elschot <paul.elschot@xs4all.nl> wrote:
> >>
> >> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> >> > My index contains documents for different users. Each document has the
> >> > user
> >> > id as a field on it.
> >> >
> >> > There are about 500 different users with 3 million documents.
> >> >
> >> > Currently I'm calling Search with the query (parsed from user)
> >> > and FieldCacheTermsFilter for the user id.
> >> >
> >> > It works but the performance is not great.
> >> >
> >> > Ideally, I would like to perform the search only on the documents that
> >> > are
> >> > relevant, this should make it much faster. However, it seems
> >> > Search(Query,
> >> > Filter) runs the query first and then applies the filter.
> >> >
> >> > Is there a way to improve this? (i.e. run the query only on a subset
> of
> >> > documents)
> >> >
> >> > Thanks
> >> >
> >>
> >> When running the query with the filter, the query is run at the same
> time
> >> as the filter. Initially and after each matching document, the filter is
> >> assumed to
> >> be cheaper to execute and its first or next matching document is
> >> determined.
> >> Then the query and the filter are repeatedly advanced to each other's
> next
> >> matching
> >> document until they are at the same document (ie. there is a match),
> >> similar to
> >> a boolean query with two required clauses.
> >> The java code doing this is in the private method
> >> IndexSearcher.searchWithFilter().
> >>
> >> It could be that filling the field cache is the performance problem.
> >> How is the performance when this search call with the
> >> FieldCacheTermsFilter
> >> is repeated?
> >>
> >> Also, for a single indexed term to be used as a filter (the user id in
> >> this case)
> >> there may be no need for a cache, a QueryWrapperFilter around the
> >> TermQuery
> >> might suffice.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message