lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: number of hits of pages containing two terms
Date Wed, 18 Mar 2009 00:54:34 GMT

: The final "production" computation is one-time, still, I have to recurrently
: come back and correct some errors, then retry...

this doesn't really seem like a problem ideally suited for Lucene ... this 
seems like the type of problem sequential batch crunching could solve 

first pass: tokenize each document into a bucket of words

second pass: count the occurances of every word, and make a list of all 
docs where the occurance is greater then N.

third pass: filter the word buckets from pass#1 so they only contain 
words in the list produced by pass#2

fourth pass: generate all pairs of words in every word bucket produced 
by pass#3

fifth pass: sort and count the uniq pairs produced by pass#4

...i have a hard time thinking in terms of Ma/Reduce steps, but i'm 
guessing a Hadoop based app could do all this in a relatively straight 
forward manner.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message