From java-user-return-52713-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Tue May 8 13:47:46 2012 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A9D7C369 for ; Tue, 8 May 2012 13:47:46 +0000 (UTC) Received: (qmail 41208 invoked by uid 500); 8 May 2012 13:47:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 41152 invoked by uid 500); 8 May 2012 13:47:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 41140 invoked by uid 99); 8 May 2012 13:47:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2012 13:47:44 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of fancyerii@gmail.com designates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2012 13:47:37 +0000 Received: by yhfq46 with SMTP id q46so5098543yhf.35 for ; Tue, 08 May 2012 06:47:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=11PaUbPj0c+tOKcJnhPwKtsI70gKwfh8eDl+g9JSkQ4=; b=yF+gRBH9F/suwmCinhKxcx7wPCflW3bNqLvIRXdpFTxBnZtP/VgUDXULJ9FuYeoq4l 8zrqqESnDP24c+JTeSx3olngXcjF+CTn9blY8d/v8C9dNqF2Vo2DLMHKqGtwn8y3xwrx qrDcG+NKt0piwUNB19Qi40Pq6dmkfDhPMwMjAKNLCHjYgrUeyVylxL6eIEi4BdXWt2qQ 3YTToE0+nwwFPrexRygCamUSzW0yxIirch1BvpMGYZwtnhiKr+1M6xBo76pPehyFXX5s 0IYPXdhTUQcAX5np6uwoBwXirjF8FGZOONfQ4pbu/eSsoGnahwgcjIDRg8g+gynOjYWE 8OPA== MIME-Version: 1.0 Received: by 10.50.191.231 with SMTP id hb7mr1470069igc.26.1336484836732; Tue, 08 May 2012 06:47:16 -0700 (PDT) Received: by 10.64.25.10 with HTTP; Tue, 8 May 2012 06:47:16 -0700 (PDT) In-Reply-To: References: <2ECDB250-53D0-42FD-9845-B71A65634C10@126.com> <46f6b034.2d449.1372c164a11.Coremail.qibaoyuan@126.com> Date: Tue, 8 May 2012 21:47:16 +0800 Message-ID: Subject: Re: Re: Many keywords problem From: Li Li To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org But this only get (term1 or term2 or term3. ....). you can't implement (term1 or term2 ...) and (term3 or term4) by this method. maybe you should writer your own Scorer to deal with this kind of queries. On Tue, May 8, 2012 at 9:44 PM, Li Li wrote: > disjunction query is much slower than conjuction query. That's why > many search engine use conjuction as default. > by the way, you say you have 5,000,000 documents. how many documents > match your query? do you need sort by relevant score or just want to > match and don't care sort? > if you don't care sort, you may try to use filter > e.g. > Query allDocsQuery=3Dparser.parse("*:*); > TermsFilter cityFilter =3D new TermsFilter(); > for (String term : terms) { > =C2=A0 =C2=A0 =C2=A0 cityFilter.addTerm(new Term("city",id)); > } > searcher.search(allDocsQuery,cityFilter); > > I am not sure this method =C2=A0is faster than boolean or query. > in theory, BooleanScorer is TAAT method(traverse each term in a 2k > window). BooleanScorer2 is DAAT algorithm. BooleanScorer is faster > than BooleanScorer2 but it can't support required queries and exlusive > queries and term count is less than 32(because it use a 32 bit integer > to remember which term hit). > TermsFilter is similar to BooleanScorer, it traverse all terms and use > a bitset to mask hited documents. if your matched document number is > very large, it may be faster than BooleanScorer2. > > > On Tue, May 8, 2012 at 6:54 PM, =E9=BD=90=E4=BF=9D=E5=85=83 wrote: >> Thanks for you reply,firstly. =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 So many= or query is to monitor the term.One scene is that:if i want to know cities= of a province and events that happens, I may instantiate the query like "(= California or NewYork or SanFransico.... or SomePlace) and (Pollution or Cr= iminal ... or Alcohol)".So, the long query happens...I hope i have describe= the question clearly.---------------- >> At 2012-05-08 18:44:13,"Li Li" wrote: >>>a disjunction (or) query of so many terms is indeed slow. >>>can u describe your real problem? why you should the disjunction >>>results of so many terms? >>> >>> >>> >>>On Sun, May 6, 2012 at 9:57 PM, qibaoyuan@126.com wr= ote: >>>> Hi, >>>> =C2=A0 =C2=A0 =C2=A0 I met a problem about how to search many keywords= =C2=A0in about 5,000,000 documents.For example the query may be like "(a1 = or a2 or a3 ....a200) and (b1 or b2 or b3 or b4 ..... b400)",I found it wil= l take vey long time(40seconds) to get the the answer in only one field(Tit= le field),and JVM will throw OutMemory error in more fields(title field plu= s content field).Any suggestions or good idea to solve this problem?thanks = in advance. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>>--------------------------------------------------------------------- >>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>For additional commands, e-mail: java-user-help@lucene.apache.org >>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org