lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <>
Subject Re: Re: Many keywords problem
Date Tue, 08 May 2012 13:44:14 GMT
disjunction query is much slower than conjuction query. That's why
many search engine use conjuction as default.
by the way, you say you have 5,000,000 documents. how many documents
match your query? do you need sort by relevant score or just want to
match and don't care sort?
if you don't care sort, you may try to use filter
Query allDocsQuery=parser.parse("*:*);
TermsFilter cityFilter = new TermsFilter();
for (String term : terms) {
       cityFilter.addTerm(new Term("city",id));

I am not sure this method  is faster than boolean or query.
in theory, BooleanScorer is TAAT method(traverse each term in a 2k
window). BooleanScorer2 is DAAT algorithm. BooleanScorer is faster
than BooleanScorer2 but it can't support required queries and exlusive
queries and term count is less than 32(because it use a 32 bit integer
to remember which term hit).
TermsFilter is similar to BooleanScorer, it traverse all terms and use
a bitset to mask hited documents. if your matched document number is
very large, it may be faster than BooleanScorer2.

On Tue, May 8, 2012 at 6:54 PM, 齐保元 <> wrote:
> Thanks for you reply,firstly.           So many or query is to monitor the term.One
scene is that:if i want to know cities of a province and events that happens, I may instantiate
the query like "(California or NewYork or SanFransico.... or SomePlace) and (Pollution or
Criminal ... or Alcohol)".So, the long query happens...I hope i have describe the question
> At 2012-05-08 18:44:13,"Li Li" <> wrote:
>>a disjunction (or) query of so many terms is indeed slow.
>>can u describe your real problem? why you should the disjunction
>>results of so many terms?
>>On Sun, May 6, 2012 at 9:57 PM, <> wrote:
>>> Hi,
>>>       I met a problem about how to search many keywords  in about 5,000,000
documents.For example the query may be like "(a1 or a2 or a3 ....a200) and (b1 or b2 or b3
or b4 ..... b400)",I found it will take vey long time(40seconds) to get the the answer in
only one field(Title field),and JVM will throw OutMemory error in more fields(title field
plus content field).Any suggestions or good idea to solve this problem?thanks in advance.
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message