Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 77221 invoked from network); 17 Mar 2009 11:35:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Mar 2009 11:35:46 -0000 Received: (qmail 67902 invoked by uid 500); 17 Mar 2009 11:35:38 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 67873 invoked by uid 500); 17 Mar 2009 11:35:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 67862 invoked by uid 99); 17 Mar 2009 11:35:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2009 04:35:38 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 137.194.2.14 is neither permitted nor denied by domain of adrian.dimulescu@gmail.com) Received: from [137.194.2.14] (HELO smtp2.enst.fr) (137.194.2.14) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2009 11:35:29 +0000 Received: from localhost (localhost [127.0.0.1]) by smtp2.enst.fr (Postfix) with ESMTP id CC665B8026 for ; Tue, 17 Mar 2009 12:35:03 +0100 (CET) X-Virus-Scanned: amavisd-new at enst.fr Received: from [137.194.160.43] (aristote.enst.fr [137.194.160.43]) by smtp2.enst.fr (Postfix) with ESMTP id 9AA23B8023 for ; Tue, 17 Mar 2009 12:35:03 +0100 (CET) Message-ID: <49BF8AF7.7000809@gmail.com> Date: Tue, 17 Mar 2009 12:35:19 +0100 From: Adrian Dimulescu User-Agent: Thunderbird 2.0.0.19 (X11/20090105) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: number of hits of pages containing two terms References: <1237198328.57959.ezmlm@lucene.apache.org> <49BE292D.6050809@gmail.com> <4e7841490903170242h1cb23b1bta01ae4c662e99a4e@mail.gmail.com> <66B3083E-06E6-4CA0-B491-3198BB315425@mikemccandless.com> <8c4e68610903170410t6157cd12u15c5bfb98cdea33@mail.gmail.com> In-Reply-To: <8c4e68610903170410t6157cd12u15c5bfb98cdea33@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Ian Lea wrote: > Adrian - have you looked any further into why your original two term > query was too slow? My experience is that simple queries are usually > extremely fast. Let me first point out that it is not "too slow" in absolute terms, it is only for my particular needs of attempting the number of co-occurrences between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million calculations). > How large is the index? I indexed Wikipedia (the 8GB-XML dump you can download). The index size is 4.4 GB. I have 39 million documents. The particularity is that I cut Wikipedia in pararaphs and I consider each paragraph as a Document (not one page per Document as usual). Which makes a lot of short documents. Each document has a stored Id and a non-stored analyzed body : doc.add(new Field("id", id, Store.YES, Index.NO)); doc.add(new Field("text", p, Store.NO, Index.ANALYZED)); > How many occurrences of your first or second > terms? I do have in my index some words that are usually qualified as "stop" words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use the SnowballAnalyzer in order to lemmatize words. My intuition is that the large number of short documents and the fact I am interested in the "stop" words do not help performance. Thank you, Adrian. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org