mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prasenjit mukherjee <>
Subject Re: newbie question: LSA anaylsis + others
Date Wed, 17 Jun 2009 06:07:21 GMT
Well, there is a  PLSI implementation using Pig ( over Hadoop ) as a mahout
patch :


On Wed, Jun 17, 2009 at 7:34 AM, Paul Jones <>wrote:

> Hi to one and all
> First time on this list, have read through the wiki, faq and other docs,
> but before I dived further into Mahout I had a few questions or should I say
> clarifications.
> I am looking for a system which would allow me to:
> 1. Take a set of words
> 2. Build clusters of these words, i.e work out the semantic relationship
> between these (I guess I could use wordnet as a starter) words. i.e
> inter-relationships
> 3. Once clusters have been formed of words, also work out relationship
> between the clusters themselves.
> so in essence I could work out that red was similiar to crimson, and hence
> a search on red would produce docs with crimson in them even though red was
> not mentioned.
> would mahout work here?
> Of course prior to this, there is the problem of cleaning up the data, i.e
> stemming etc.
> Now I have read several detailed papers on clustering, ranking, etc, and of
> course some algos are better than others, but to me a platform like Mahout
> seems interesting since you can deploy the existing ones in the system, and
> also later on add others.
> Looking at the algorithms it seems as if LSI (PLSI) has not been
> implemented as yet, if so which other algo would "suffice" in this case.
> Admitedley my knowledge of algos is poor to say the least :-). Also where
> would (if it does) Lucene fit in, would it be used to search the results
> after the algo's had been applied ? since it seems as if Lucene just uses a
> weighting system to create the index, or can Mahout do it all.
> As you can see confused, but this is my first pass at this system.
> tks
> Paul
> P.S are any of the algo's feedback algo's, i.e so that someone could
> inprove results using user feedback.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message