Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates
 209.85.218.174 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=pe7NUxnYKoT4dnUZxnrNQuWN6AjwaaYZnaUdiFV6oAORnV3M7lABkQye8S5oaeuLAh
         0PcPt5x/QcouE6zm1My0C2AARtrnAltKnkZOaCfux+piiS11GUsXVZRK+FW1A7Qg8cge
         BnUsSnkfQ0XSnNUJ5lucFFr3kXAfUo3EJ4mEY=
MIME-Version: 1.0
In-Reply-To: <49BF8AF7.7000809@gmail.com>
References: <1237198328.57959.ezmlm@lucene.apache.org>
	 <49BE292D.6050809@gmail.com>
	 <ADDF6BB8-358B-46C5-A310-53DEDDD0C933@mikemccandless.com>
	 <4e7841490903170242h1cb23b1bta01ae4c662e99a4e@mail.gmail.com>
	 <66B3083E-06E6-4CA0-B491-3198BB315425@mikemccandless.com>
	 <8c4e68610903170410t6157cd12u15c5bfb98cdea33@mail.gmail.com>
	 <49BF8AF7.7000809@gmail.com>
Date: Tue, 17 Mar 2009 11:59:46 +0000
Message-ID: <8c4e68610903170459maaae876nff472389d88508a6@mail.gmail.com>
Subject: Re: number of hits of pages containing two terms
From: Ian Lea <ian.lea@gmail.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

OK - thanks for the explanation.  So this is not just a simple search ...

I'll go away and leave you and Michael and the other experts to talk
about clever solutions.


--
Ian.


On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu
<adrian.dimulescu@gmail.com> wrote:
> Ian Lea wrote:
>>
>> Adrian - have you looked any further into why your original two term
>> query was too slow? =A0My experience is that simple queries are usually
>> extremely fast.
>
> Let me first point out that it is not "too slow" in absolute terms, it is
> only for my particular needs of attempting the number of co-occurrences
> between ideally all non-noise terms (I plan about 10 k x 10 k =3D 100 mil=
lion
> calculations).
>>
>> How large is the index?
>
> I indexed Wikipedia (the 8GB-XML dump you can download). The index size i=
s
> 4.4 GB. I have 39 million documents. The particularity is that I cut
> Wikipedia in pararaphs and I consider each paragraph as a Document (not o=
ne
> page per Document as usual). Which makes a lot of short documents. Each
> document has a stored Id =A0and a non-stored analyzed body :
>
> =A0 =A0 =A0 =A0 =A0 doc.add(new Field("id", id, Store.YES, Index.NO));
> =A0 =A0 =A0 =A0 =A0 doc.add(new Field("text", p, Store.NO, Index.ANALYZED=
));
>
>> How many occurrences of your first or second
>> terms?
>
> I do have in my index some words that are usually qualified as "stop" wor=
ds.
> My first two terms are "and" : 13M hits and "s" : 4M hits. I use the
> SnowballAnalyzer in order to lemmatize words.
>
> My intuition is that the large number of short documents and the fact I a=
m
> interested in the "stop" words do not help performance.
>
> Thank you,
> Adrian.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org