lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Dimulescu <adrian.dimule...@gmail.com>
Subject Re: number of hits of pages containing two terms
Date Tue, 17 Mar 2009 11:35:19 GMT
Ian Lea wrote:
> Adrian - have you looked any further into why your original two term
> query was too slow?  My experience is that simple queries are usually
> extremely fast.  
Let me first point out that it is not "too slow" in absolute terms, it 
is only for my particular needs of attempting the number of 
co-occurrences between ideally all non-noise terms (I plan about 10 k x 
10 k = 100 million calculations).
> How large is the index?
I indexed Wikipedia (the 8GB-XML dump you can download). The index size 
is 4.4 GB. I have 39 million documents. The particularity is that I cut 
Wikipedia in pararaphs and I consider each paragraph as a Document (not 
one page per Document as usual). Which makes a lot of short documents. 
Each document has a stored Id  and a non-stored analyzed body :

            doc.add(new Field("id", id, Store.YES, Index.NO));
            doc.add(new Field("text", p, Store.NO, Index.ANALYZED));

> How many occurrences of your first or second
> terms?  
I do have in my index some words that are usually qualified as "stop" 
words. My first two terms are "and" : 13M hits and "s" : 4M hits. I use 
the SnowballAnalyzer in order to lemmatize words.

My intuition is that the large number of short documents and the fact I 
am interested in the "stop" words do not help performance.

Thank you,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message