Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 93164 invoked from network); 17 Mar 2009 12:00:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Mar 2009 12:00:23 -0000 Received: (qmail 10559 invoked by uid 500); 17 Mar 2009 12:00:16 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10529 invoked by uid 500); 17 Mar 2009 12:00:16 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10518 invoked by uid 99); 17 Mar 2009 12:00:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2009 05:00:15 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates 209.85.218.174 as permitted sender) Received: from [209.85.218.174] (HELO mail-bw0-f174.google.com) (209.85.218.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2009 12:00:07 +0000 Received: by bwz22 with SMTP id 22so30824bwz.5 for ; Tue, 17 Mar 2009 04:59:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=rZcR+cY4urrd4qSOtM0uekogstaa1NBE/iXGG4hHbFM=; b=m8Klms9goXbp4x3RcmYLz+S78+aSQEiXa7rpuJFBNHL4AK7ByEHQjJ2OI7npCUzkVP ab0G6fpDtwq7ISkMj4wGsN9jLNbKC8nLW8zN5xJkYZpmVZq/KyB9+q/rJppn08+3wAZF 1QVHgP5bB387ryDEdqb2IqJm/cDXEfR7WSYug= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=pe7NUxnYKoT4dnUZxnrNQuWN6AjwaaYZnaUdiFV6oAORnV3M7lABkQye8S5oaeuLAh 0PcPt5x/QcouE6zm1My0C2AARtrnAltKnkZOaCfux+piiS11GUsXVZRK+FW1A7Qg8cge BnUsSnkfQ0XSnNUJ5lucFFr3kXAfUo3EJ4mEY= MIME-Version: 1.0 Received: by 10.204.55.200 with SMTP id v8mr1994730bkg.54.1237291186327; Tue, 17 Mar 2009 04:59:46 -0700 (PDT) In-Reply-To: <49BF8AF7.7000809@gmail.com> References: <1237198328.57959.ezmlm@lucene.apache.org> <49BE292D.6050809@gmail.com> <4e7841490903170242h1cb23b1bta01ae4c662e99a4e@mail.gmail.com> <66B3083E-06E6-4CA0-B491-3198BB315425@mikemccandless.com> <8c4e68610903170410t6157cd12u15c5bfb98cdea33@mail.gmail.com> <49BF8AF7.7000809@gmail.com> Date: Tue, 17 Mar 2009 11:59:46 +0000 Message-ID: <8c4e68610903170459maaae876nff472389d88508a6@mail.gmail.com> Subject: Re: number of hits of pages containing two terms From: Ian Lea To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org OK - thanks for the explanation. So this is not just a simple search ... I'll go away and leave you and Michael and the other experts to talk about clever solutions. -- Ian. On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu wrote: > Ian Lea wrote: >> >> Adrian - have you looked any further into why your original two term >> query was too slow? =A0My experience is that simple queries are usually >> extremely fast. > > Let me first point out that it is not "too slow" in absolute terms, it is > only for my particular needs of attempting the number of co-occurrences > between ideally all non-noise terms (I plan about 10 k x 10 k =3D 100 mil= lion > calculations). >> >> How large is the index? > > I indexed Wikipedia (the 8GB-XML dump you can download). The index size i= s > 4.4 GB. I have 39 million documents. The particularity is that I cut > Wikipedia in pararaphs and I consider each paragraph as a Document (not o= ne > page per Document as usual). Which makes a lot of short documents. Each > document has a stored Id =A0and a non-stored analyzed body : > > =A0 =A0 =A0 =A0 =A0 doc.add(new Field("id", id, Store.YES, Index.NO)); > =A0 =A0 =A0 =A0 =A0 doc.add(new Field("text", p, Store.NO, Index.ANALYZED= )); > >> How many occurrences of your first or second >> terms? > > I do have in my index some words that are usually qualified as "stop" wor= ds. > My first two terms are "and" : 13M hits and "s" : 4M hits. I use the > SnowballAnalyzer in order to lemmatize words. > > My intuition is that the large number of short documents and the fact I a= m > interested in the "stop" words do not help performance. > > Thank you, > Adrian. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org