mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jones <>
Subject Re: mahout PLSI (with some lucene, thrown in)
Date Tue, 23 Jun 2009 13:45:40 GMT
tks Grant, more questions...I think it better if I explain what I am trying to do.

1. I want to crawl blogs which talk about "cars" - To me Nutch would do this
2. Of this I want then to be able to search for various words ... "red toyota" - 

In order to do point 2, I would need to index all the data, and provide a "rank/rating" to
each result. 
Nutch does this using a similar scoring mech. to lucene, and (based on what you mentioned)
I read the Nutch can Boost, the url, anchor, title etc.
Nutch can also allow search, BUT is lucene better for a large scale system, since it seems
to allow "better" searching or at least access to it. If so I would need to give Lucene access
to the index created by Nutch (I guess one of my questions is what happens during indexing?
is it the scoring/rating, or just "indexing" to allow faster data retrieval"). Is this correct?

3. There is a inter-relationship between "words" in the documents, and a relationship between
the "word" and the webpage itself, so a td-idf works out the "relationship" between the keywords
and the documents, i.e "red" is more relevant to doc1, than doc2. 

This Lucene can do, and gives a basic rating system based on searched keyword, and document
returned...Hopefully so far so good :-)

4. But what if I wanted to understand the relationships between the keywords themselves. Assume
I had the word 'red" and wanted to display those similar to "red" like "crimson". i.e if I
have collected 100K keywords, and wanted to build a clusters of these keywords, so that "red,
crimson, ruby, magenta" formed cluster 1, and "blue, azure, ultramarine" formed cluster 2.
Then when someone searched for "ruby" although the td-idf calc would show "No results" I could
look up in my cluster and see what other colours are similar and fire a query for "red or
crimson or magenta" hence it would return a value, based on the cluster in which that colour
was present.
Use case: A user searches for "red cars" my crawling has picked up crimson cars only, now
unless I know crimson and red are "related" I may have zero results. 

I guess in the case of colours a manual cluster may need to be formed, but surely there must
be a way of clustering these words dynamically. Imagine we have crawled 100K webpages, and
we have 100 pages which show "red" and 100 which show "crimson" and then 100 which show both
"red and crimson" is there a way to deduce that there maybe (albeit weak) relationship between
red AND crimson. Of course we can pre-seed this info, which then gets weighted by actual results.

5. And this is where Mahout comes in...or at least I think it does. Mahout has lots of clever
algo's underneath the hood, some more relevant that others. Where I am really getting confused
is at what point in my pipeline to deploy these.

Nutch ---> Mahout ---> Lucene ---> Taste ---> Mahout         [crawl + index ---->
algos for clustering, distance, rating ---> search ---> user feedback ----> algo's......]

If I wanted to implement PLSI for me the above scenario would work, BUT how would the scoring
done by Nutch affect the data fed into Mahout for this, should the data just be raw (parsed
etc, but no rating), the processed, the opened for search, and then user feedback dropped

Hope thats a little clearer. Wondering what setups people have? i.e the block level order
in which the data is processed. Maybe I am reading it wrong and its not a one to one process.

tks for reading


From: Grant Ingersoll <>
Sent: Tuesday, 23 June, 2009 11:39:31
Subject: Re: mahout PLSI (with some lucene, thrown in)

On Jun 21, 2009, at 9:39 PM, Paul Jones wrote:

> I think I am starting to get a feel for what each of these frameworks can achieve, however
due to overlap in some of these applications, I am curious about how each one exposes data
to the other, again trawled through the lists, best I can, and read the Lucene in action book
over the weekend.
> To me Nutch should be used as a crawler, rather than a indexer (but I have read that
Nutch is better than indexing than lucene, and hence lucene should be used just for search).

Nutch uses Lucene for indexing.  The two aren't really comparable.  Lucene is a search library.
 Nutch is an application designed for large scale crawling and search.  Nutch tends to be
pretty monolithic.  You might be happier with Solr, as it is more flexible and easier to configure,
but still gives you access to Lucene.

> Mahout seems to come into its element when you are playing with various algorithms, whether
for clustering, nearest neighbour or whatever, but lucene also seems to work with term-vectors
(as does Nutch), to work out the "distance" between words, if so, once this is done, are the
words then already ranked. If so, then would you run other algos like PLSI on that data, or
(at least to me) it makes more sense to take the data from Nutch, use Mahout, and then puch
back in Lucene to search with.

One aspect of Mahout (or machine learning) that I find intriguing is using it to power "intelligent"
search.  In this case, you use ML to extract/categorize/cluster, etc. all in an effort to
make it easier for people to search/discover the information they are looking for.

There are, of course, many other uses that have nothing to do with search and there is nothing
about Mahout other than the LuceneIterable class in utils and a few helper classes to make
working  with text easier.  It is perfectly reasonable to use Mahout on numerical data or
even mixed data as long as you can properly setup the problem.

> Another question on indexing:
> The vector calc's or term-freq are building the relationship between words in a document/web
page. e.g "red" is related to "crimson", but how does this relate back to ranking the documents
themselves in a search query, so you search for "red" now it is related to "crimson" but if
doc1 has "red" in it it should be returned at pos1, and the one with "crimson" at pos2. I
am going to try to answer my own question let me know if the answer sucks...
> Each word is related to a document word -> doc
> So has relationships between words are formed, then inherently relationship between the
docs are also deduced from here. Is this kind of correct? so you need not worry about ranking
the document itself? Or are there two indexes, one which contains the relationships between
the words with a doc, and the other which relates each word to each doc, if this is true can
you run different algos on each problem to get the end results.
> e.g red relates to crimson with value 1, and red relates to blue with value 0.5 so we
have relationships between words
> Now red related to doc 1 as +1, and relates to doc 2 +0.5, and crimson relates to doc2
as +1, hence we have relationship between words and the docs
> phew....

I'm not sure I'm following.  In traditional TF-IDF search (i.e. Lucene) red and crimson would
both relate to one or more docs.  Whether red or crimson comes first is going to depend on
the statistics of the collection.  Presumably, if you have same a priori information about
those words (maybe based on your analysis of the documents) you could boost one word even
more such that red comes first.

> So two more questions :-), I looked at intergrating user feedback, if we assume we have
obtained the feedback, and a person thinks doc1 is actually about "crimson" how would this
be intergrated back into the algos, would this be via the boost function in Lucene, or is
there a better way of doing it using Taste and dropping it into the Mahout anaylsis.

I'd say you could likely do it with either.

> and ... how do you rate the "words" from a Title, Meta Tag, Image Alt text higher than
other words in the webpage, or even say user defined Tags in say blogs

In Lucene, you can do this several ways.  If you want to boost the whole field, then do just
that.  If you want to boost individual terms in a given field, you need to use Payloads and
the BoostingTermQuery.

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message