lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Dimulescu <adrian.dimule...@gmail.com>
Subject Re: number of hits of pages containing two terms
Date Tue, 17 Mar 2009 16:58:23 GMT
Michael McCandless wrote:
> Is this a one-time computation?  If so, couldn't you wait a long time
> for the machine to simply finish it?
The final "production" computation is one-time, still, I have to 
recurrently come back and correct some errors, then retry...
>
> With the simple approach (doing 100 million 2-term AND queries), how
> long do you estimate it'd take?
About the estimated time, my existing index is really problematic, I 
should look for ways to optimize it, but I really think analyzer-time 
frequencies should do the job.
> I think you could do this with your own analyzer (as you
> suggested)... it would run normal tokenization, gather all unique
> terms that occurred, discard the "noise" terms (odd to me that you
> don't consider stop words as noise -- or maybe you mean noise (non
> salient terms) at the bigram level?)
By noise I mainly meant terms with frequency 1 (misspelled words, 
garbage escaping my Analyzer). In my current attempts I am really 
interested by common words.

Thanks for the advice,
Adrian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message