lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Using filters to speed up queries
Date Sun, 24 Oct 2010 11:08:10 GMT
The trick is to wrap the TermQuery using a ConstantScoreQuery(new
QueryWrapperFilter(new TermQuery(.))). Because for filtering, the TermQuery
used instead of a filter should not contribute to score. This code is used
quite often in Lucene, so don't care about the strange looking code. E.g. in
MultiTermQuery.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:50 PM
To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

Terribly sorry. I meant Mike:

 

> 

Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.

 

On 24 October 2010 11:46, Uwe Schindler <uwe@thetaphi.de> wrote:

Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:34 PM


To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user
query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as
well.

 

I got the best result (about 4x faster) by using a cached filter for the
DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe
mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of
documents.

 

Also, I still would like to try the multi index approach, but not sure about
the memory, file handle burden of it (having potentially thousands of
reades/writers/searchers) open at the same time. I use two processes one as
indexer and one for search with the same underlying FSDirectory. As for
search, I use writer.getReader().reopen within a SearchManager as suggested
by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <paul.elschot@xs4all.nl> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the
user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is
assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next
matching
document until they are at the same document (ie. there is a match), similar
to
a boolean query with two required clauses.
The java code doing this is in the private method
IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this
case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 

 


Mime
View raw message