mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Conwell <j...@iamjohn.me>
Subject Re: Clustering or Classification?
Date Wed, 01 Aug 2012 18:10:08 GMT
here is an article I ran across a few weeks ago that I think describes what
your after (at least at a high level)
http://blog.getprismatic.com/blog/2012/4/17/clustering-related-stories.html


On Wed, Aug 1, 2012 at 10:08 AM, Salman Mahmood <salman.03@gmail.com> wrote:

> Hi all,
>
> I am stuck between a decision to apply classification or clustering on the
> data set I got. The more I think about it, the more I get confused. Heres
> what I am confronted with.
>
> I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
>
> To solve this, I turned to Mahout.
>
> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> etc as top terms in my clusters and from there I would know the news in a
> cluster corresponds to its cluster label, but things were a bit different.
> I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
> 'shares', 'street', 'olympics' and lots of other terms as the top ones
> (which makes sense as clustering algos' look for common terms). Although
> there were some 'Apple' clusters but the news items associated with it were
> very few.I thought may be clustering is not for this kind of problem as
> many of the company news goes into more general clusters(investment,
> profit) instead of the specific company cluster(Apple).
>
> I started reading about classification which requires training data, The
> name was convincing too as I actually want to 'classify' my news items into
> 'company names'. As I read on, I got an impression that the name
> classification is a bit deceiving and the technique is used more for
> prediction purposes as compared to classification. The other confusions
> that I got was how can I prepare training data for news documents? lets
> assume I have a list of companies that I am interested in. I write a
> program to produce training data for the classifier. the program will see
> if the news title or description contains the company name 'Apple' then its
> a news story about apple. Is this how I can prepare training data?(off
> course I read that training data is actually a set of predictors and target
> variables). If so, then why should I use mahout classification in the first
> place? I should ditch mahout and instead use this little program that I
> wrote for training data(which actually does the classification)
>
> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
>
> Thank you in advance for pointing me in the right direction.
>



-- 

Thanks,
John C

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message