lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: number of hits of pages containing two terms
Date Tue, 17 Mar 2009 12:20:47 GMT
Is this a one-time computation?  If so, couldn't you wait a long time
for the machine to simply finish it?

With the simple approach (doing 100 million 2-term AND queries), how
long do you estimate it'd take?

I think you could do this with your own analyzer (as you
suggested)... it would run normal tokenization, gather all unique
terms that occurred, discard the "noise" terms (odd to me that you
don't consider stop words as noise -- or maybe you mean noise (non
salient terms) at the bigram level?), but then have custom code that
emits "all pairs" of the unique tokens it encountered, which Lucene
then indexes.

The problem is this is an O(N^2) sort of thing, but maybe
co-occurrence constraints of natural language, plus the fact that
you're dealing w/ paragraphs, make this tenable / faster than the
simple approach above.


Adrian Dimulescu wrote:

> Ian Lea wrote:
>> Adrian - have you looked any further into why your original two term
>> query was too slow?  My experience is that simple queries are usually
>> extremely fast.
> Let me first point out that it is not "too slow" in absolute terms,  
> it is only for my particular needs of attempting the number of co- 
> occurrences between ideally all non-noise terms (I plan about 10 k x  
> 10 k = 100 million calculations).
>> How large is the index?
> I indexed Wikipedia (the 8GB-XML dump you can download). The index  
> size is 4.4 GB. I have 39 million documents. The particularity is  
> that I cut Wikipedia in pararaphs and I consider each paragraph as a  
> Document (not one page per Document as usual). Which makes a lot of  
> short documents. Each document has a stored Id  and a non-stored  
> analyzed body :
>           doc.add(new Field("id", id, Store.YES, Index.NO));
>           doc.add(new Field("text", p, Store.NO, Index.ANALYZED));
>> How many occurrences of your first or second
>> terms?
> I do have in my index some words that are usually qualified as  
> "stop" words. My first two terms are "and" : 13M hits and "s" : 4M  
> hits. I use the SnowballAnalyzer in order to lemmatize words.
> My intuition is that the large number of short documents and the  
> fact I am interested in the "stop" words do not help performance.
> Thank you,
> Adrian.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message