mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: From example to job
Date Sat, 14 Jul 2012 04:46:43 GMT
A Mahout vector is not one format- it is a family of data structures
optimized for various tasks.
A Mahout vector file is a Hadoop sequencefile of 0 or more entries of
Writable duples (Writable, VectorWritable).

Several programs require the first writable to be an int row number.
There may be single-process programs which require the row numbers in
sequence. There should not be any Hadoop-friendly programs which
require this.

The "rowid" you will notice referred to a lot is a pair of programs
that replace the Writable with a unique integer, and save the Writable
out to a dictionary of int->writable sequencefile.

On Thu, Jul 12, 2012 at 12:30 PM, Robert Hall <> wrote:
> Greetings.
> I'm trying to jump from the examples in mahout to a practical job of my
> very own. First, I'm very new to mahout but I do have some experience with
> machine learning, clustering, and classifications.
> My goal: To get KMeans clusters of time-based use from structured data
> Example Input:
> John Doe,1324,1233,2234,1267,1456,1745,1212
> There's a name and a variable series of numbers that correspond to time in
> seconds to complete an operation. The times are pre-filtered > 1200 and
> built by date/time (pivoted into nameless columns) of the operation, but
> the date/time is not relevant to my goal.
> Can someone point me toward any resources that explain, not how to run an
> example, but how the examples were put together?
> If not a resource, how about a high-level description on what mahout is
> looking for and how it does, say a KMeans cluster analysis.
> Finally, can someone describe a mahout vector and vector file? A
> description plus the actual format of a vector row/file.
> --
> Robert Hall

Lance Norskog

View raw message