Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 14384 invoked from network); 3 Feb 2011 04:16:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 04:16:08 -0000 Received: (qmail 99844 invoked by uid 500); 3 Feb 2011 04:16:08 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 99614 invoked by uid 500); 3 Feb 2011 04:16:05 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 99606 invoked by uid 99); 3 Feb 2011 04:16:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 04:16:03 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sharathjagannath@gmail.com designates 209.85.213.170 as permitted sender) Received: from [209.85.213.170] (HELO mail-yx0-f170.google.com) (209.85.213.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 04:15:56 +0000 Received: by yxt33 with SMTP id 33so304048yxt.1 for ; Wed, 02 Feb 2011 20:15:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=fh7NB1RLSA/TYbVqwywzkKfdKwvX/Yb8/8lMQiiG6vg=; b=mFuCysXIXbVXcUPBawQmfIgs3z1trxSLu4kvs0c+v9aqdeMIEAnJxVg5w1KmtJU3eb o+4NCc8nDpGFsPEI6IaTVVY1vUNWxtws9FbEBJjjWS0RRIBNo0bDhuu68QS47bbF11o/ wEiiunOfSzeCgLVnchjGlNTe1ilvTpsPXqRZ8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=tzS/Z9BKngN0Iwn8TkBlkCEQxaAGrNgtmZ305ZNEu+BI0jvPNwXoRPpMiHtau9V2DF lD91cYDNRUkKuxiCGBiFAUqmrukIIFydY0f/zrDOdwFI2lJ2ZE8RzvxGR6YYqW/6IDR4 LIfLmIUyNzNHpdX3/8O36yACEZABP+qdB8GZY= MIME-Version: 1.0 Received: by 10.100.195.2 with SMTP id s2mr2041697anf.97.1296706534909; Wed, 02 Feb 2011 20:15:34 -0800 (PST) Received: by 10.100.124.20 with HTTP; Wed, 2 Feb 2011 20:15:34 -0800 (PST) In-Reply-To: References: Date: Wed, 2 Feb 2011 20:15:34 -0800 Message-ID: Subject: Re: Need help: beginner From: sharath jagannath To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0016e644ba1664269f049b5904c7 X-Virus-Checked: Checked by ClamAV on apache.org --0016e644ba1664269f049b5904c7 Content-Type: text/plain; charset=ISO-8859-1 Thanks Dmitriy and Shige, @Shige: #1 - Thanks for sharing it. #2 - Yeah, I guess I need to write my own convertor, I need to use the Tags|Ranks which is associated with text and the vector should be composition of all the tags. #3 - I am quite not sure of it. I want another part of my app to invoke it. That is what I am not getting it. Thanks, Sharath On Wed, Feb 2, 2011 at 7:06 PM, Shige Takeda wrote: > Hi Sharath, > > Although I'm novice too, let me try to answer: > > ==> #1. In my case, I package Mahout and all dependencies into one jar > (mahout-mcontents-app-1.0-SNAPSHOT-job.jar). > Here is my Maven pom and assembly job file for your reference: > > https://github.com/smtakeda/mahout-mcontents/blob/master/pom.xml > > https://github.com/smtakeda/mahout-mcontents/blob/master/src/main/assembly/job.xml > > ==> #2. You may need to write a converter from your data to a vector. I'm > not sure how tag/rank can be normalized along with text data, though. > Mahout's builtin seq2sequence generates from text to a term vector using > Lucene's StandardAnalyzer. > > ==> #3. Using #1, I can run SparseVectorsFromSequenceFiles as follows: > hadoop jar target/mahout-mcontents-app-1.0-SNAPSHOT-job.jar > org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles > -Dmapred.create.symlink=yes > -Dmapred.cache.archives=igo-naist-jdic.zip#igo-naist-jdic > -Dmapred.cache.files=stop_words.txt#stop_words.txt -i text_output -o > seq_output4 -a com.mcontents.lucene.analysis.IgoNaistdicAnalyzer > > Why not using seq2sparse via Mahout driver? Because I need to use a custom > Analyzer class, com.mcontents.lucene.analysis.IgoNaistdicAnalyzer.class to > generate term vectors from Japanese text, and the class must be referenced > by the classloader from SparseVectorsFromSequenceFiles. > > If you want to run your application that depends on Mahout, you should be > able to use the similar technique. > > Please note I use TRUNK, which I recommend you to use, as it has lots of > improvements. > > Hope this helps, > -- > Shige Takeda > shige@mcontents.com > smtakeda@gmail.com > > On Wed, Feb 2, 2011 at 5:10 PM, sharath jagannath < > sharathjagannath@gmail.com> wrote: > > > With some effort, I am now able to write a simple KMeans cluster. > > But it is writing all the data to the local folder again and not to the > > hadoop FS. > > Again, I wanted to use Mahout as a jar and build my application which I > > could not. > > So, now I am adding classes to the example folder within the > > mahout-distribution. > > > > Any help/suggestion is much appreciated. I am kinda stuck with this. It > is > > so frustrating having not able to proceed. > > I need suggestion on: > > 1. packaging mahout as a jar/suggestion on integrating it with rest of > the > > application. > > 2. How to create Vector from my data, my training set will be having a > > context\tList. > > 3. How to run the driver on the hadoop. I am setting environment > variables > > (in eclipse) using: properties->run/debug settings->environment->new > > > > Code: (Came with MEAP) > > > > package org.apache.mahout.clustering.syntheticcontrol.kmeans; > > > > import java.io.File; > > import java.io.IOException; > > import java.util.ArrayList; > > import java.util.List; > > > > > > > > import org.apache.hadoop.conf.Configuration; > > import org.apache.hadoop.fs.FileSystem; > > import org.apache.hadoop.fs.Path; > > import org.apache.hadoop.io.IntWritable; > > import org.apache.hadoop.io.LongWritable; > > import org.apache.hadoop.io.SequenceFile; > > import org.apache.hadoop.io.Text; > > import org.apache.mahout.clustering.WeightedVectorWritable; > > import org.apache.mahout.clustering.kmeans.Cluster; > > import org.apache.mahout.clustering.kmeans.KMeansClusterer; > > import org.apache.mahout.clustering.kmeans.KMeansDriver; > > import org.apache.mahout.common.distance.EuclideanDistanceMeasure; > > import org.apache.mahout.math.RandomAccessSparseVector; > > import org.apache.mahout.math.Vector; > > import org.apache.mahout.math.VectorWritable; > > > > public class SimpleKMeansCluster { > > public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2, 2}, > > {3, 3}, {8, 8}, > > {9, 8}, {8, 9}, {9, 9}}; > > public static void writePointsToFile(List points, String > fileName, > > FileSystem fs, Configuration conf) throws IOException { > > Path path = new Path(fileName); > > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, > > path, LongWritable.class, VectorWritable.class); > > long recNum = 0; > > VectorWritable vec = new VectorWritable(); > > for (Vector point : points) { > > vec.set(point); writer.append(new LongWritable(recNum++), vec); > > } > > writer.close(); > > } > > public static List getPoints(double[][] raw) { > > List points = new ArrayList(); > > for (int i = 0; i < raw.length; i++) { > > double[] fr = raw[i]; > > Vector vec = new RandomAccessSparseVector(fr.length); > > vec.assign(fr); points.add(vec); > > } > > return points; > > } > > public static void main (String []args) throws Exception { > > int k = 2; List vectors = getPoints(points); > > File testData = new File("testdata1"); > > if (!testData.exists()) { > > testData.mkdir(); > > } > > testData = new File("testdata1/points"); > > if (!testData.exists()) { > > testData.mkdir(); > > } > > Configuration conf = new Configuration(); > > FileSystem fs = FileSystem.get(conf); > > writePointsToFile(vectors, "testdata1/points/file1", fs, conf); > > Path path = new Path("testdata1/clusters/part-00000"); > > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, > > path, Text.class, Cluster.class); > > for (int i = 0; i < k; i++) { > > Vector vec = vectors.get(i); > > Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure()); > > writer.append(new Text(cluster.getIdentifier()), cluster); > > } > > writer.close(); > > > > KMeansDriver.run(conf, new Path("testdata1/points"), > > new Path("testdata1/clusters"), new Path("output"), new > > EuclideanDistanceMeasure(), 0.001, 10, true, false); > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new > Path("output/" > > + Cluster.CLUSTERED_POINTS_DIR > > + "/part-m-00000"), conf); > > IntWritable key = new IntWritable(); > > WeightedVectorWritable value = new WeightedVectorWritable(); > > while (reader.next(key, value)) { > > System.out.println(value.toString() + " belongs to cluster " + > > key.toString()); > > } reader.close(); > > } > > } > > > > > > Thanks, > > Sharath > > > > > > On Wed, Feb 2, 2011 at 2:43 PM, sharath jagannath < > > sharathjagannath@gmail.com> wrote: > > > > > I did not pass any argument. I used the default one: > > > > > > log.info("Running with default arguments"); > > > > > > Path output = new Path("output"); > > > > > > HadoopUtil.overwriteOutput(output); > > > > > > new Job().run(new Configuration(), new Path("testdata"), output, > > newEuclideanDistanceMeasure(), 80, 55, 0.5, 10); > > > > > > > > > > > > > > > > > > -- > Shige Takeda > shige@mcontents.com > --0016e644ba1664269f049b5904c7--