mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joyce Babu <jo...@joycebabu.com>
Subject Re: Mahout for Keyword Extraction
Date Thu, 03 Feb 2011 14:54:14 GMT
Thanks for the details Vineet.

I have already tried KEA with a training set of 300 stories and keywords generated using OpenCalais,
but the output was of very low quality (I did not use any vocabulary or stop words). When
I tried with the linked open data from data.nytimes.com, the output quality was good. I think
it has potential with a good vocabulary. But KEA doesn't return the relevance value.

I will go through the provided links on the different algorithms. It will take me some time
to digest it completely :)

Can I use clustering to detect similar documents? For example, the past one week there were
several news stories on the Egypt unrest. I need to detect and group them. Is it possible
to do this with mahout?

Joyce
On Thursday 3 February 2011 at 7:07 PM, vineet yadav wrote: 
> Hi Joyce,
> Mahout uses clustering algorithm to extract top terms or topics from
> documents sets. It uses basically three types of algorithm for keyword
> extraction .
> 1) Collocations extraction:-
> https://cwiki.apache.org/confluence/display/MAHOUT/Collocations
> 2) Clustering algorithm: It supports clustering algorithm like k-means,
> fuzzy k-mean, cancopy etc.
> 3)Latent Dirichet Allocation:-
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> Mahout uses simple unsupervised(clustering) algorithm for keyword
> extraction. Where as I think OpenCalasis uses supervised and deep semantic
> approaches. I think you are looking some supervised(classification)
> algorithm for keyphrase extraction. I suggest to look at kea(
> http://www.nzdl.org/Kea/download.html) and maui-indexer(
> http://code.google.com/p/maui-indexer/)
> Thanks
> Vineet Yadav
> 
> On Thu, Feb 3, 2011 at 6:51 PM, Joyce Babu <joyce@joycebabu.com> wrote:
> 
> > Hi,
> > 
> > I am new to Java and Machine Learning concept. I was searching for a method
> > to extract keywords (like names of people, organization, places etc) from
> > new stories sorted by relevance. I found several web services like
> > OpenCalais that provide similar service, but they don't detect most of my
> > terms. I have a list of approved keywords, and only need to detect from that
> > list.
> > 
> > I found out about Machine Learning and got interested in the concept. I
> > read somewhere that the classification feature of mahout can be used for
> > detecting keywords by classifying terms as keywords and non-keywords. I have
> > been trying to learn mahout for the past 30 hours, but haven't reached
> > anywhere. It is not useful to waste time trying to learn, if mahout is not
> > the tool to solve my problem.
> > 
> > Can someone provide details on using mahout for term extraction? Is it
> > possible to do this with little to medium knowledge in Java? Is it an
> > overkill to use mahout for this? Should I go for an NLP solution?
> > 
> > Thanks,
> > Joyce
> 

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message