Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of smtakeda@gmail.com designates
 209.85.161.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=EuTCVRm6WINOMU+jcgE9xRmpf2pdmiU73XYCqpTB/saTkfQK7syP0B49XFP8XP3bGT
         78lKRuWUuzc023iUodGjIBsUS8SfPX2fKLS13bW3aKSXnTdqPJmM36JgxMpOP1oLqwKf
         90L1tNjXSv+00UypgYifXwrrr9B+iZ8Q+W+qQ=
MIME-Version: 1.0
Sender: smtakeda@gmail.com
In-Reply-To: <AANLkTik801UpB3fy1KqzgMO4K-s=73SK6Z=TEj0or6XR@mail.gmail.com>
References: <AANLkTinSAvireD1LC3chPWB1j9mjYWcGzK6Ze5RnAsLY@mail.gmail.com>
	<AANLkTi=wYGndeMTHqqs1U7pgPx7yvdaSYqqxMn54TyMi@mail.gmail.com>
	<AANLkTimckh6RSBNXoD2HDWxiRzP5N5TLc+OaPTzDWqAk@mail.gmail.com>
	<AANLkTik801UpB3fy1KqzgMO4K-s=73SK6Z=TEj0or6XR@mail.gmail.com>
Date: Wed, 2 Feb 2011 19:06:31 -0800
Message-ID: <AANLkTi=TKxrYQNS=a0Pr3=vEj-CS15Xbfw94Tt+m-G06@mail.gmail.com>
Subject: Re: Need help: beginner
From: Shige Takeda <shige@mcontents.com>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=001636c594da702a06049b580d07

--001636c594da702a06049b580d07
Content-Type: text/plain; charset=ISO-8859-1

Hi Sharath,

Although I'm novice too, let me try to answer:

==> #1. In my case, I package Mahout and all dependencies into one jar
(mahout-mcontents-app-1.0-SNAPSHOT-job.jar).
Here is my Maven pom and assembly job file for your reference:

https://github.com/smtakeda/mahout-mcontents/blob/master/pom.xml
https://github.com/smtakeda/mahout-mcontents/blob/master/src/main/assembly/job.xml

==> #2. You may need to write a converter from your data to a vector. I'm
not sure how tag/rank can be normalized along with text data, though.
Mahout's builtin seq2sequence generates from text to a term vector using
Lucene's StandardAnalyzer.

==> #3. Using #1, I can run SparseVectorsFromSequenceFiles as follows:
hadoop jar target/mahout-mcontents-app-1.0-SNAPSHOT-job.jar
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
-Dmapred.create.symlink=yes
-Dmapred.cache.archives=igo-naist-jdic.zip#igo-naist-jdic
-Dmapred.cache.files=stop_words.txt#stop_words.txt  -i text_output -o
seq_output4 -a com.mcontents.lucene.analysis.IgoNaistdicAnalyzer

Why not using seq2sparse via Mahout driver? Because I need to use a custom
Analyzer class, com.mcontents.lucene.analysis.IgoNaistdicAnalyzer.class to
generate term vectors from Japanese text, and the class must be referenced
by the classloader from SparseVectorsFromSequenceFiles.

If you want to run your application that depends on Mahout, you should be
able to use the similar technique.

Please note I use TRUNK, which I recommend you to use, as it has lots of
improvements.

Hope this helps,
-- 
Shige Takeda
shige@mcontents.com
smtakeda@gmail.com

On Wed, Feb 2, 2011 at 5:10 PM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> With some effort, I am now able to write a simple KMeans cluster.
> But it is writing all the data to the local folder again and not to the
> hadoop FS.
> Again, I wanted to use Mahout as a jar and build my application which I
> could not.
> So, now I am adding classes to the example folder within the
> mahout-distribution.
>
> Any help/suggestion is much appreciated. I am kinda stuck with this. It is
> so frustrating having not able to proceed.
> I need suggestion on:
> 1. packaging mahout as a  jar/suggestion on integrating it with rest of the
> application.
> 2. How to create Vector from my data, my training set will be having a
> context\tList<tags|rank>.
> 3. How to run the driver on the hadoop. I am setting environment variables
> (in eclipse) using: properties->run/debug settings->environment->new
>
> Code: (Came with MEAP)
>
> package org.apache.mahout.clustering.syntheticcontrol.kmeans;
>
> import java.io.File;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
>
>
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IntWritable;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.mahout.clustering.WeightedVectorWritable;
> import org.apache.mahout.clustering.kmeans.Cluster;
> import org.apache.mahout.clustering.kmeans.KMeansClusterer;
> import org.apache.mahout.clustering.kmeans.KMeansDriver;
> import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
> import org.apache.mahout.math.RandomAccessSparseVector;
> import org.apache.mahout.math.Vector;
> import org.apache.mahout.math.VectorWritable;
>
> public class SimpleKMeansCluster {
> public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2, 2},
> {3, 3}, {8, 8},
> {9, 8}, {8, 9}, {9, 9}};
>  public static void writePointsToFile(List<Vector> points, String fileName,
> FileSystem fs, Configuration conf) throws IOException {
> Path path = new Path(fileName);
> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> path, LongWritable.class, VectorWritable.class);
> long recNum = 0;
> VectorWritable vec = new VectorWritable();
> for (Vector point : points) {
> vec.set(point); writer.append(new LongWritable(recNum++), vec);
> }
> writer.close();
> }
>  public static List<Vector> getPoints(double[][] raw) {
> List<Vector> points = new ArrayList<Vector>();
> for (int i = 0; i < raw.length; i++) {
> double[] fr = raw[i];
> Vector vec = new RandomAccessSparseVector(fr.length);
> vec.assign(fr); points.add(vec);
> }
> return points;
> }
>  public static void main (String []args) throws Exception {
> int k = 2; List<Vector> vectors = getPoints(points);
> File testData = new File("testdata1");
> if (!testData.exists()) {
> testData.mkdir();
> }
>  testData = new File("testdata1/points");
> if (!testData.exists()) {
> testData.mkdir();
> }
>  Configuration conf = new Configuration();
> FileSystem fs = FileSystem.get(conf);
> writePointsToFile(vectors, "testdata1/points/file1", fs, conf);
> Path path = new Path("testdata1/clusters/part-00000");
> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> path, Text.class, Cluster.class);
>  for (int i = 0; i < k; i++) {
> Vector vec = vectors.get(i);
> Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
> writer.append(new Text(cluster.getIdentifier()), cluster);
> }
> writer.close();
>
> KMeansDriver.run(conf, new Path("testdata1/points"),
> new Path("testdata1/clusters"), new Path("output"), new
> EuclideanDistanceMeasure(), 0.001, 10, true, false);
> SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("output/"
> + Cluster.CLUSTERED_POINTS_DIR
> + "/part-m-00000"), conf);
> IntWritable key = new IntWritable();
> WeightedVectorWritable value = new WeightedVectorWritable();
>  while (reader.next(key, value)) {
> System.out.println(value.toString() + " belongs to cluster " +
> key.toString());
> } reader.close();
> }
> }
>
>
> Thanks,
> Sharath
>
>
> On Wed, Feb 2, 2011 at 2:43 PM, sharath jagannath <
> sharathjagannath@gmail.com> wrote:
>
> > I did not pass any argument. I used the default one:
> >
> >       log.info("Running with default arguments");
> >
> >       Path output = new Path("output");
> >
> >       HadoopUtil.overwriteOutput(output);
> >
> >       new Job().run(new Configuration(), new Path("testdata"), output,
> newEuclideanDistanceMeasure(), 80, 55, 0.5, 10);
> >
> >
> >
> >
>


-- 
Shige Takeda
shige@mcontents.com

--001636c594da702a06049b580d07--