lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Using filters to speed up queries
Date Mon, 25 Oct 2010 10:28:28 GMT
Here's the paper I was thinking of (Robert found this):
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 ...
eg note this sentence from the abstract:

    We show that the first implementation, based on a postprocessing
approach, allows an arbitrary user to obtain information about the
content of files for which he does not have read permission.

Note that one simple way to "gauge" performance of filtering-down-low
would be to open an IndexReader, delete all documents except those
matching your filter (eg the ACCOUNT filter), then run your searches
against that IndexReader without the ACCOUNT clause.  If you don't
close that reader then these deletes are never committed.  This is a
simple way to compile in a filter to an open IR, but, you'd still then
have one reader open per user class, so the risks of too many files,
etc still stands.

Hmm though you could open an initial reader, then clone it, then do
all your deletes on that clone for user class 1, then clone it again,
do all deletes on that clone for user class 2.  This way you only have
one set of open files, but you've "compiled" your filter into the
delete docs for each reader.

But, in order to do this, you'd have to disable locking (use
NoLockFactory) in your Directory impl, just for these readers, since
you know you'll never commit the readers with pending deletions.  Just
be sure you never close those readers!

This should give sizable speedups if the filter is non-sparse.

Mike

On Sun, Oct 24, 2010 at 6:34 AM, Khash Sajadi <khash@sajadi.co.uk> wrote:
> Here is what I've found so far:
>
> I have three main sets to use in a query:
> Account MUST be xxx
> User query
> DateRange on the query MUST be in (a,b) it is a NumericField
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
> 1. One:
> - Add ACCOUNT as a TermQuery
> - Add DATE RANGE as Filter
> 2. Two
> - Add ACCOUNT as Filer
> - Add DATE RANGE as NumericRangeQuery
> I tried caching the filters on both scenarios.
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
> Also, I still would like to try the multi index approach, but not sure about
> the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
>
>
>
> On 24 October 2010 10:27, Paul Elschot <paul.elschot@xs4all.nl> wrote:
>>
>> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
>> > My index contains documents for different users. Each document has the
>> > user
>> > id as a field on it.
>> >
>> > There are about 500 different users with 3 million documents.
>> >
>> > Currently I'm calling Search with the query (parsed from user)
>> > and FieldCacheTermsFilter for the user id.
>> >
>> > It works but the performance is not great.
>> >
>> > Ideally, I would like to perform the search only on the documents that
>> > are
>> > relevant, this should make it much faster. However, it seems
>> > Search(Query,
>> > Filter) runs the query first and then applies the filter.
>> >
>> > Is there a way to improve this? (i.e. run the query only on a subset of
>> > documents)
>> >
>> > Thanks
>> >
>>
>> When running the query with the filter, the query is run at the same time
>> as the filter. Initially and after each matching document, the filter is
>> assumed to
>> be cheaper to execute and its first or next matching document is
>> determined.
>> Then the query and the filter are repeatedly advanced to each other's next
>> matching
>> document until they are at the same document (ie. there is a match),
>> similar to
>> a boolean query with two required clauses.
>> The java code doing this is in the private method
>> IndexSearcher.searchWithFilter().
>>
>> It could be that filling the field cache is the performance problem.
>> How is the performance when this search call with the
>> FieldCacheTermsFilter
>> is repeated?
>>
>> Also, for a single indexed term to be used as a filter (the user id in
>> this case)
>> there may be no need for a cache, a QueryWrapperFilter around the
>> TermQuery
>> might suffice.
>>
>> Regards,
>> Paul Elschot
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message