mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sharath jagannath <sharathjagann...@gmail.com>
Subject Re: Need help: beginner
Date Thu, 03 Feb 2011 04:15:34 GMT
Thanks Dmitriy and Shige,

@Shige:
#1 - Thanks for sharing it.
#2 - Yeah, I guess I need to write my own convertor, I need to use the
Tags|Ranks which is associated with text and the vector should be
composition of all the tags.
#3 - I am quite not sure of it. I want another part of my app to invoke it.
That is what I am not getting it.

Thanks,
Sharath

On Wed, Feb 2, 2011 at 7:06 PM, Shige Takeda <shige@mcontents.com> wrote:

> Hi Sharath,
>
> Although I'm novice too, let me try to answer:
>
> ==> #1. In my case, I package Mahout and all dependencies into one jar
> (mahout-mcontents-app-1.0-SNAPSHOT-job.jar).
> Here is my Maven pom and assembly job file for your reference:
>
> https://github.com/smtakeda/mahout-mcontents/blob/master/pom.xml
>
> https://github.com/smtakeda/mahout-mcontents/blob/master/src/main/assembly/job.xml
>
> ==> #2. You may need to write a converter from your data to a vector. I'm
> not sure how tag/rank can be normalized along with text data, though.
> Mahout's builtin seq2sequence generates from text to a term vector using
> Lucene's StandardAnalyzer.
>
> ==> #3. Using #1, I can run SparseVectorsFromSequenceFiles as follows:
> hadoop jar target/mahout-mcontents-app-1.0-SNAPSHOT-job.jar
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
> -Dmapred.create.symlink=yes
> -Dmapred.cache.archives=igo-naist-jdic.zip#igo-naist-jdic
> -Dmapred.cache.files=stop_words.txt#stop_words.txt  -i text_output -o
> seq_output4 -a com.mcontents.lucene.analysis.IgoNaistdicAnalyzer
>
> Why not using seq2sparse via Mahout driver? Because I need to use a custom
> Analyzer class, com.mcontents.lucene.analysis.IgoNaistdicAnalyzer.class to
> generate term vectors from Japanese text, and the class must be referenced
> by the classloader from SparseVectorsFromSequenceFiles.
>
> If you want to run your application that depends on Mahout, you should be
> able to use the similar technique.
>
> Please note I use TRUNK, which I recommend you to use, as it has lots of
> improvements.
>
> Hope this helps,
> --
> Shige Takeda
> shige@mcontents.com
> smtakeda@gmail.com
>
> On Wed, Feb 2, 2011 at 5:10 PM, sharath jagannath <
> sharathjagannath@gmail.com> wrote:
>
> > With some effort, I am now able to write a simple KMeans cluster.
> > But it is writing all the data to the local folder again and not to the
> > hadoop FS.
> > Again, I wanted to use Mahout as a jar and build my application which I
> > could not.
> > So, now I am adding classes to the example folder within the
> > mahout-distribution.
> >
> > Any help/suggestion is much appreciated. I am kinda stuck with this. It
> is
> > so frustrating having not able to proceed.
> > I need suggestion on:
> > 1. packaging mahout as a  jar/suggestion on integrating it with rest of
> the
> > application.
> > 2. How to create Vector from my data, my training set will be having a
> > context\tList<tags|rank>.
> > 3. How to run the driver on the hadoop. I am setting environment
> variables
> > (in eclipse) using: properties->run/debug settings->environment->new
> >
> > Code: (Came with MEAP)
> >
> > package org.apache.mahout.clustering.syntheticcontrol.kmeans;
> >
> > import java.io.File;
> > import java.io.IOException;
> > import java.util.ArrayList;
> > import java.util.List;
> >
> >
> >
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.hadoop.io.IntWritable;
> > import org.apache.hadoop.io.LongWritable;
> > import org.apache.hadoop.io.SequenceFile;
> > import org.apache.hadoop.io.Text;
> > import org.apache.mahout.clustering.WeightedVectorWritable;
> > import org.apache.mahout.clustering.kmeans.Cluster;
> > import org.apache.mahout.clustering.kmeans.KMeansClusterer;
> > import org.apache.mahout.clustering.kmeans.KMeansDriver;
> > import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
> > import org.apache.mahout.math.RandomAccessSparseVector;
> > import org.apache.mahout.math.Vector;
> > import org.apache.mahout.math.VectorWritable;
> >
> > public class SimpleKMeansCluster {
> > public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2, 2},
> > {3, 3}, {8, 8},
> > {9, 8}, {8, 9}, {9, 9}};
> >  public static void writePointsToFile(List<Vector> points, String
> fileName,
> > FileSystem fs, Configuration conf) throws IOException {
> > Path path = new Path(fileName);
> > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> > path, LongWritable.class, VectorWritable.class);
> > long recNum = 0;
> > VectorWritable vec = new VectorWritable();
> > for (Vector point : points) {
> > vec.set(point); writer.append(new LongWritable(recNum++), vec);
> > }
> > writer.close();
> > }
> >  public static List<Vector> getPoints(double[][] raw) {
> > List<Vector> points = new ArrayList<Vector>();
> > for (int i = 0; i < raw.length; i++) {
> > double[] fr = raw[i];
> > Vector vec = new RandomAccessSparseVector(fr.length);
> > vec.assign(fr); points.add(vec);
> > }
> > return points;
> > }
> >  public static void main (String []args) throws Exception {
> > int k = 2; List<Vector> vectors = getPoints(points);
> > File testData = new File("testdata1");
> > if (!testData.exists()) {
> > testData.mkdir();
> > }
> >  testData = new File("testdata1/points");
> > if (!testData.exists()) {
> > testData.mkdir();
> > }
> >  Configuration conf = new Configuration();
> > FileSystem fs = FileSystem.get(conf);
> > writePointsToFile(vectors, "testdata1/points/file1", fs, conf);
> > Path path = new Path("testdata1/clusters/part-00000");
> > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> > path, Text.class, Cluster.class);
> >  for (int i = 0; i < k; i++) {
> > Vector vec = vectors.get(i);
> > Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
> > writer.append(new Text(cluster.getIdentifier()), cluster);
> > }
> > writer.close();
> >
> > KMeansDriver.run(conf, new Path("testdata1/points"),
> > new Path("testdata1/clusters"), new Path("output"), new
> > EuclideanDistanceMeasure(), 0.001, 10, true, false);
> > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
> Path("output/"
> > + Cluster.CLUSTERED_POINTS_DIR
> > + "/part-m-00000"), conf);
> > IntWritable key = new IntWritable();
> > WeightedVectorWritable value = new WeightedVectorWritable();
> >  while (reader.next(key, value)) {
> > System.out.println(value.toString() + " belongs to cluster " +
> > key.toString());
> > } reader.close();
> > }
> > }
> >
> >
> > Thanks,
> > Sharath
> >
> >
> > On Wed, Feb 2, 2011 at 2:43 PM, sharath jagannath <
> > sharathjagannath@gmail.com> wrote:
> >
> > > I did not pass any argument. I used the default one:
> > >
> > >       log.info("Running with default arguments");
> > >
> > >       Path output = new Path("output");
> > >
> > >       HadoopUtil.overwriteOutput(output);
> > >
> > >       new Job().run(new Configuration(), new Path("testdata"), output,
> > newEuclideanDistanceMeasure(), 80, 55, 0.5, 10);
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Shige Takeda
> shige@mcontents.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message