Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: 137.194.2.14 is neither permitted
 nor denied by domain of adrian.dimulescu@gmail.com)
Message-ID: <49BF8AF7.7000809@gmail.com>
Date: Tue, 17 Mar 2009 12:35:19 +0100
From: Adrian Dimulescu <adrian.dimulescu@gmail.com>
User-Agent: Thunderbird 2.0.0.19 (X11/20090105)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: number of hits of pages containing two terms
References: <1237198328.57959.ezmlm@lucene.apache.org>
	 <49BE292D.6050809@gmail.com>
	 <ADDF6BB8-358B-46C5-A310-53DEDDD0C933@mikemccandless.com>
	 <4e7841490903170242h1cb23b1bta01ae4c662e99a4e@mail.gmail.com>
	 <66B3083E-06E6-4CA0-B491-3198BB315425@mikemccandless.com>
 <8c4e68610903170410t6157cd12u15c5bfb98cdea33@mail.gmail.com>
In-Reply-To: <8c4e68610903170410t6157cd12u15c5bfb98cdea33@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Ian Lea wrote:
> Adrian - have you looked any further into why your original two term
> query was too slow?  My experience is that simple queries are usually
> extremely fast.  
Let me first point out that it is not "too slow" in absolute terms, it 
is only for my particular needs of attempting the number of 
co-occurrences between ideally all non-noise terms (I plan about 10 k x 
10 k = 100 million calculations).
> How large is the index?
I indexed Wikipedia (the 8GB-XML dump you can download). The index size 
is 4.4 GB. I have 39 million documents. The particularity is that I cut 
Wikipedia in pararaphs and I consider each paragraph as a Document (not 
one page per Document as usual). Which makes a lot of short documents. 
Each document has a stored Id  and a non-stored analyzed body :

            doc.add(new Field("id", id, Store.YES, Index.NO));
            doc.add(new Field("text", p, Store.NO, Index.ANALYZED));

> How many occurrences of your first or second
> terms?  
I do have in my index some words that are usually qualified as "stop" 
words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use 
the SnowballAnalyzer in order to lemmatize words.

My intuition is that the large number of short documents and the fact I 
am interested in the "stop" words do not help performance.

Thank you,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org