Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 85434 invoked from network); 3 Feb 2011 03:07:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 03:07:01 -0000 Received: (qmail 63292 invoked by uid 500); 3 Feb 2011 03:07:00 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 63072 invoked by uid 500); 3 Feb 2011 03:06:58 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 63064 invoked by uid 99); 3 Feb 2011 03:06:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 03:06:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of smtakeda@gmail.com designates 209.85.161.42 as permitted sender) Received: from [209.85.161.42] (HELO mail-fx0-f42.google.com) (209.85.161.42) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 03:06:53 +0000 Received: by fxm11 with SMTP id 11so711626fxm.1 for ; Wed, 02 Feb 2011 19:06:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=nugRHjxioxHtN2tl1WX+nNgVbuye4BDtWQatMDW45rc=; b=BS+qghleVI6yW2mtMa1phyF+mXwhN6L9k6ug95lyURsK0eUZuG7G+O7igXyoOq+Rze 8rhZFQPnfS8VVv1quIg0BkkEkUAuSHd5faL5X65mDyUDXKiZTX//XfkGSqAg9NS4Qjnv pZxdpAOU63oH3zfDxObuzpjiwR7rrG/IeX2hA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=EuTCVRm6WINOMU+jcgE9xRmpf2pdmiU73XYCqpTB/saTkfQK7syP0B49XFP8XP3bGT 78lKRuWUuzc023iUodGjIBsUS8SfPX2fKLS13bW3aKSXnTdqPJmM36JgxMpOP1oLqwKf 90L1tNjXSv+00UypgYifXwrrr9B+iZ8Q+W+qQ= MIME-Version: 1.0 Received: by 10.223.123.142 with SMTP id p14mr9378053far.56.1296702391723; Wed, 02 Feb 2011 19:06:31 -0800 (PST) Sender: smtakeda@gmail.com Received: by 10.223.155.134 with HTTP; Wed, 2 Feb 2011 19:06:31 -0800 (PST) In-Reply-To: References: Date: Wed, 2 Feb 2011 19:06:31 -0800 X-Google-Sender-Auth: k4w7d9LG6d4jNFTSqRSbyoRFzms Message-ID: Subject: Re: Need help: beginner From: Shige Takeda To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001636c594da702a06049b580d07 --001636c594da702a06049b580d07 Content-Type: text/plain; charset=ISO-8859-1 Hi Sharath, Although I'm novice too, let me try to answer: ==> #1. In my case, I package Mahout and all dependencies into one jar (mahout-mcontents-app-1.0-SNAPSHOT-job.jar). Here is my Maven pom and assembly job file for your reference: https://github.com/smtakeda/mahout-mcontents/blob/master/pom.xml https://github.com/smtakeda/mahout-mcontents/blob/master/src/main/assembly/job.xml ==> #2. You may need to write a converter from your data to a vector. I'm not sure how tag/rank can be normalized along with text data, though. Mahout's builtin seq2sequence generates from text to a term vector using Lucene's StandardAnalyzer. ==> #3. Using #1, I can run SparseVectorsFromSequenceFiles as follows: hadoop jar target/mahout-mcontents-app-1.0-SNAPSHOT-job.jar org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles -Dmapred.create.symlink=yes -Dmapred.cache.archives=igo-naist-jdic.zip#igo-naist-jdic -Dmapred.cache.files=stop_words.txt#stop_words.txt -i text_output -o seq_output4 -a com.mcontents.lucene.analysis.IgoNaistdicAnalyzer Why not using seq2sparse via Mahout driver? Because I need to use a custom Analyzer class, com.mcontents.lucene.analysis.IgoNaistdicAnalyzer.class to generate term vectors from Japanese text, and the class must be referenced by the classloader from SparseVectorsFromSequenceFiles. If you want to run your application that depends on Mahout, you should be able to use the similar technique. Please note I use TRUNK, which I recommend you to use, as it has lots of improvements. Hope this helps, -- Shige Takeda shige@mcontents.com smtakeda@gmail.com On Wed, Feb 2, 2011 at 5:10 PM, sharath jagannath < sharathjagannath@gmail.com> wrote: > With some effort, I am now able to write a simple KMeans cluster. > But it is writing all the data to the local folder again and not to the > hadoop FS. > Again, I wanted to use Mahout as a jar and build my application which I > could not. > So, now I am adding classes to the example folder within the > mahout-distribution. > > Any help/suggestion is much appreciated. I am kinda stuck with this. It is > so frustrating having not able to proceed. > I need suggestion on: > 1. packaging mahout as a jar/suggestion on integrating it with rest of the > application. > 2. How to create Vector from my data, my training set will be having a > context\tList. > 3. How to run the driver on the hadoop. I am setting environment variables > (in eclipse) using: properties->run/debug settings->environment->new > > Code: (Came with MEAP) > > package org.apache.mahout.clustering.syntheticcontrol.kmeans; > > import java.io.File; > import java.io.IOException; > import java.util.ArrayList; > import java.util.List; > > > > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.io.IntWritable; > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.SequenceFile; > import org.apache.hadoop.io.Text; > import org.apache.mahout.clustering.WeightedVectorWritable; > import org.apache.mahout.clustering.kmeans.Cluster; > import org.apache.mahout.clustering.kmeans.KMeansClusterer; > import org.apache.mahout.clustering.kmeans.KMeansDriver; > import org.apache.mahout.common.distance.EuclideanDistanceMeasure; > import org.apache.mahout.math.RandomAccessSparseVector; > import org.apache.mahout.math.Vector; > import org.apache.mahout.math.VectorWritable; > > public class SimpleKMeansCluster { > public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2, 2}, > {3, 3}, {8, 8}, > {9, 8}, {8, 9}, {9, 9}}; > public static void writePointsToFile(List points, String fileName, > FileSystem fs, Configuration conf) throws IOException { > Path path = new Path(fileName); > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, > path, LongWritable.class, VectorWritable.class); > long recNum = 0; > VectorWritable vec = new VectorWritable(); > for (Vector point : points) { > vec.set(point); writer.append(new LongWritable(recNum++), vec); > } > writer.close(); > } > public static List getPoints(double[][] raw) { > List points = new ArrayList(); > for (int i = 0; i < raw.length; i++) { > double[] fr = raw[i]; > Vector vec = new RandomAccessSparseVector(fr.length); > vec.assign(fr); points.add(vec); > } > return points; > } > public static void main (String []args) throws Exception { > int k = 2; List vectors = getPoints(points); > File testData = new File("testdata1"); > if (!testData.exists()) { > testData.mkdir(); > } > testData = new File("testdata1/points"); > if (!testData.exists()) { > testData.mkdir(); > } > Configuration conf = new Configuration(); > FileSystem fs = FileSystem.get(conf); > writePointsToFile(vectors, "testdata1/points/file1", fs, conf); > Path path = new Path("testdata1/clusters/part-00000"); > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, > path, Text.class, Cluster.class); > for (int i = 0; i < k; i++) { > Vector vec = vectors.get(i); > Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure()); > writer.append(new Text(cluster.getIdentifier()), cluster); > } > writer.close(); > > KMeansDriver.run(conf, new Path("testdata1/points"), > new Path("testdata1/clusters"), new Path("output"), new > EuclideanDistanceMeasure(), 0.001, 10, true, false); > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("output/" > + Cluster.CLUSTERED_POINTS_DIR > + "/part-m-00000"), conf); > IntWritable key = new IntWritable(); > WeightedVectorWritable value = new WeightedVectorWritable(); > while (reader.next(key, value)) { > System.out.println(value.toString() + " belongs to cluster " + > key.toString()); > } reader.close(); > } > } > > > Thanks, > Sharath > > > On Wed, Feb 2, 2011 at 2:43 PM, sharath jagannath < > sharathjagannath@gmail.com> wrote: > > > I did not pass any argument. I used the default one: > > > > log.info("Running with default arguments"); > > > > Path output = new Path("output"); > > > > HadoopUtil.overwriteOutput(output); > > > > new Job().run(new Configuration(), new Path("testdata"), output, > newEuclideanDistanceMeasure(), 80, 55, 0.5, 10); > > > > > > > > > -- Shige Takeda shige@mcontents.com --001636c594da702a06049b580d07--