mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parnab kumar <>
Subject Re: Help for Grouping similar items together. Clustering/Classification problem?
Date Mon, 21 Jul 2014 19:59:37 GMT

     For only 1.5 million items, I feel employing mahout is not required.
Any other machine learning software like, weka should be enough.
Next, I see you have only 2 attributes -  id and name. Unless these are
compounded with additional features no classifier or clustering algorithm
will work.

Try using some external knowledge like Wikipedia  or even first 10 result
snippets from a standard search engine to gather  additional attributes in
form of description.


On Tue, Jul 22, 2014 at 1:13 AM, Andreas Spalas <>

> Hi,
> these days I am exploring Mahout Framework in order to solve a specific
> problem.
> The problem is that I have a csv file with 1.5 Million items - products
> with the following format:
> id, product_title
> 1, Apple IPHONE 5
> 2, Samsung Galaxy S5
> etc..
> and I would like to group the items-products together in terms of category
> so for example in the above case both products would be under "Technology"
> or "Smartphones" Category.
> I would like to know if this is possible to handle in Mahout and whether
> someone would choose clustering or classification way in order to solve
> such a problem.
> As, I am studying "Mahout in action" currently I saw that for Clustering
> case I have to transform my data into a SequenceFile and find a way of
> vectorization and I don't really get if this is applicable to my case at
> the moment. For, the second case of classification I understand that I have
> to provide some training data with target variable(in my case "Category")
> in order to create a model for the classification system and I can extend
> my dataset with this extra info but is it going to work?
> Can anyone give me some advice on how to handle this particular problem?Is
> it even possible to do it in Mahout? Any direction would be aprreciated!
> Thanks alot in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message