mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jones <>
Subject mahout PLSI (with some lucene, thrown in)
Date Mon, 22 Jun 2009 01:39:29 GMT
I think I am starting to get a feel for what each of these frameworks can achieve, however
due to overlap in some of these applications, I am curious about how each one exposes data
to the other, again trawled through the lists, best I can, and read the Lucene in action book
over the weekend. 

 To me Nutch should be used as a crawler, rather than a indexer (but I have read that Nutch
is better than indexing than lucene, and hence lucene should be used just for search). Mahout
seems to come into its element when you are playing with various algorithms, whether for clustering,
nearest neighbour or whatever, but lucene also seems to work with term-vectors (as does Nutch),
to work out the "distance" between words, if so, once this is done, are the words then already
ranked. If so, then would you run other algos like PLSI on that data, or (at least to me)
it makes more sense to take the data from Nutch, use Mahout, and then puch back in Lucene
to search with.

Another question on indexing:

The vector calc's or term-freq are building the relationship between words in a document/web
page. e.g "red" is related to "crimson", but how does this relate back to ranking the documents
themselves in a search query, so you search for "red" now it is related to "crimson" but if
doc1 has "red" in it it should be returned at pos1, and the one with "crimson" at pos2. I
am going to try to answer my own question let me know if the answer sucks...

Each word is related to a document word -> doc
So has relationships between words are formed, then inherently relationship between the docs
are also deduced from here. Is this kind of correct? so you need not worry about ranking the
document itself? Or are there two indexes, one which contains the relationships between the
words with a doc, and the other which relates each word to each doc, if this is true can you
run different algos on each problem to get the end results.

e.g red relates to crimson with value 1, and red relates to blue with value 0.5 so we have
relationships between words
Now red related to doc 1 as +1, and relates to doc 2 +0.5, and crimson relates to doc2 as
+1, hence we have relationship between words and the docs


So two more questions :-), I looked at intergrating user feedback, if we assume we have obtained
the feedback, and a person thinks doc1 is actually about "crimson" how would this be intergrated
back into the algos, would this be via the boost function in Lucene, or is there a better
way of doing it using Taste and dropping it into the Mahout anaylsis.

and ... how do you rate the "words" from a Title, Meta Tag, Image Alt text higher than other
words in the webpage, or even say user defined Tags in say blogs

Sorry for the long post, just got lots of questions going round, and what to design what I
need on paper first before I delve in with rm -rf ..tks


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message