lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <>
Subject Re: number of hits of pages containing two terms
Date Tue, 17 Mar 2009 11:59:46 GMT
OK - thanks for the explanation.  So this is not just a simple search ...

I'll go away and leave you and Michael and the other experts to talk
about clever solutions.


On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu
<> wrote:
> Ian Lea wrote:
>> Adrian - have you looked any further into why your original two term
>> query was too slow?  My experience is that simple queries are usually
>> extremely fast.
> Let me first point out that it is not "too slow" in absolute terms, it is
> only for my particular needs of attempting the number of co-occurrences
> between ideally all non-noise terms (I plan about 10 k x 10 k = 100 million
> calculations).
>> How large is the index?
> I indexed Wikipedia (the 8GB-XML dump you can download). The index size is
> 4.4 GB. I have 39 million documents. The particularity is that I cut
> Wikipedia in pararaphs and I consider each paragraph as a Document (not one
> page per Document as usual). Which makes a lot of short documents. Each
> document has a stored Id  and a non-stored analyzed body :
>           doc.add(new Field("id", id, Store.YES, Index.NO));
>           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
>> How many occurrences of your first or second
>> terms?
> I do have in my index some words that are usually qualified as "stop" words.
> My first two terms are "and" : 13M hits and "s" : 4M hits. I use the
> SnowballAnalyzer in order to lemmatize words.
> My intuition is that the large number of short documents and the fact I am
> interested in the "stop" words do not help performance.
> Thank you,
> Adrian.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message