lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Using filters to speed up queries
Date Sun, 24 Oct 2010 10:46:47 GMT
Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:34 PM
To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user
query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as
well.

 

I got the best result (about 4x faster) by using a cached filter for the
DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe
mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of
documents.

 

Also, I still would like to try the multi index approach, but not sure about
the memory, file handle burden of it (having potentially thousands of
reades/writers/searchers) open at the same time. I use two processes one as
indexer and one for search with the same underlying FSDirectory. As for
search, I use writer.getReader().reopen within a SearchManager as suggested
by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <paul.elschot@xs4all.nl> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the
user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is
assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next
matching
document until they are at the same document (ie. there is a match), similar
to
a boolean query with two required clauses.
The java code doing this is in the private method
IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this
case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 


Mime
View raw message