mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eduard Gamonal <>
Subject Re: getting started with mahout and kmeans
Date Tue, 27 Nov 2012 16:45:25 GMT
I'll start with a small dataset, ~1000 rows (data points?) of about 20
features each.
the thing is that I'd like to use a much bigger dataset later and, in
that case, bash would be too slow, wouldn't it?

On Tue, Nov 27, 2012 at 12:54 AM, Ted Dunning <> wrote:
> How many data points are you clustering?  How many dimensions?
> On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal <>wrote:
>> Hi,
>> I'm doing a MSc at Northeastern and I'm working on analyzing some US
>> election polls with kmeans.
>> I'm a beginner with both Mahout and Hadoop. I've been reading the docs
>> but I'd still appreciate some orientation on these questions:
>> * I can transform my input data into vectors and run k-means using the
>> command line [1]. I downloaded hadoop (1.0.4, working in a real
>> cluster) and I wrote a program for it. Then I downloaded Mahout and I
>> saw that there is a  jar file included (0.20, single node:
>> M2_REPO/org/apache/hadoop/hadoop-core/
>> ). If I point HADOOP_HOME to my hadoop installation, will mahout use
>> it? I set HADOOP_HOME in hadoop/conf/, though.
>> * I might need to remove some columns of my data set. With Hadoop I
>> could write a program to tokenize the input and create the data
>> structures I needed, and then call kmeansdriver. I can use bash to
>> remove the columns and mahout from command line. should I write a
>> program instead?
>> * How do I write a program for Mahout 0.7 (and Hadoop 1.x), from scratch?
>> I need to transform the dataset: Vectors should be created only with
>> the features I want k-means to consider to cluster my data. Then I can
>> call kmeansdriver.  I think I can do both using the explanation of
>> Should the main class extend any other?
>> How do I deploy it in a cluster with hadoop?
>> * it is my understanding that mahout is a framework. I read the code
>> example in org.apache.mahout.clustering.syntheticcontrol.kmeans. It
>> extends AbstractJob. I made a new project in Eclipse and copied the
>> example. My goal was to run it. I tried "java -jar myjar.jar" and
>> passing my new jar as a parameter to hadoop. What's the correct way of
>> running a program for mahout?
>> Thanks
>> [1]

View raw message