mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Need help: beginner
Date Thu, 03 Feb 2011 02:38:33 GMT
1) Generally, my take is that you don't want to integrate Mahout with your
application's class space ("embed" it)

For the same reason why you probably don't want embed e.g. Pig . Pig is
known to be non-reentrant (i.e. you can kick off only one task at a time if
you embed it) and it also leaks resources/memory so eventually you'd get
OOM. I did not test Mahout for these problems, but i think at this point
command-line integration would be a prudent course.

After all, command line interface is much more developed than api anyway.

If you did want to embed it, you'd need to create something like maven
assembly that inlcludes mahout along with all its transitive dependencies.
Depending on your comfort level with maven you may or may not want doing
that.

Additionally, you may need to embed all hadoop jars as well if you do
embedding (unless your bootstrap script can figure $HADOOP_HOME/lib/*.jar
and add them to classpath of your program during startup -- i think that's
what mahout and pig do). I think that's considered a 'mainstream' way of
writing hadoop clients (i.e. instead of embedding hadoop jars into assembly,
use $HADOOP_HOME/lib/*.jar)

Bottom line, integrating mahout thru command line seems to be more robust
and hassle-free to me. But that's me.

3) use ToolRunner (assuming you are set on the embedding aproach). Either
you use command line or embed, i suspect you'd need to install hadoop client
or at least include core-site.xml into your classpath at startup in order
for MR clients to pick up cluster coordinates.



On Wed, Feb 2, 2011 at 5:46 PM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> Anybody?
>
>
> On Wed, Feb 2, 2011 at 5:36 PM, sharath jagannath <
> sharathjagannath@gmail.com> wrote:
>
> > My bad, I am using File in the program and expecting it to write to the
> > Hadoop FS.
> > Given that I still need to figure how to hook on to hadoop.
> >
> > Thanks,
> > Sharath
> >
> >
> > On Wed, Feb 2, 2011 at 5:10 PM, sharath jagannath <
> > sharathjagannath@gmail.com> wrote:
> >
> >>
> >> With some effort, I am now able to write a simple KMeans cluster.
> >> But it is writing all the data to the local folder again and not to the
> >> hadoop FS.
> >> Again, I wanted to use Mahout as a jar and build my application which I
> >> could not.
> >> So, now I am adding classes to the example folder within the
> >> mahout-distribution.
> >>
> >> Any help/suggestion is much appreciated. I am kinda stuck with this. It
> is
> >> so frustrating having not able to proceed.
> >> I need suggestion on:
> >> 1. packaging mahout as a  jar/suggestion on integrating it with rest of
> >> the application.
> >> 2. How to create Vector from my data, my training set will be having a
> >> context\tList<tags|rank>.
> >> 3. How to run the driver on the hadoop. I am setting environment
> variables
> >> (in eclipse) using: properties->run/debug settings->environment->new
> >>
> >> Code: (Came with MEAP)
> >>
> >> package org.apache.mahout.clustering.syntheticcontrol.kmeans;
> >>
> >> import java.io.File;
> >> import java.io.IOException;
> >> import java.util.ArrayList;
> >> import java.util.List;
> >>
> >>
> >>
> >> import org.apache.hadoop.conf.Configuration;
> >> import org.apache.hadoop.fs.FileSystem;
> >> import org.apache.hadoop.fs.Path;
> >> import org.apache.hadoop.io.IntWritable;
> >> import org.apache.hadoop.io.LongWritable;
> >> import org.apache.hadoop.io.SequenceFile;
> >> import org.apache.hadoop.io.Text;
> >> import org.apache.mahout.clustering.WeightedVectorWritable;
> >> import org.apache.mahout.clustering.kmeans.Cluster;
> >> import org.apache.mahout.clustering.kmeans.KMeansClusterer;
> >> import org.apache.mahout.clustering.kmeans.KMeansDriver;
> >> import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
> >> import org.apache.mahout.math.RandomAccessSparseVector;
> >> import org.apache.mahout.math.Vector;
> >> import org.apache.mahout.math.VectorWritable;
> >>
> >> public class SimpleKMeansCluster {
> >>  public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2,
> >> 2}, {3, 3}, {8, 8},
> >> {9, 8}, {8, 9}, {9, 9}};
> >>  public static void writePointsToFile(List<Vector> points, String
> >> fileName,
> >>  FileSystem fs, Configuration conf) throws IOException {
> >> Path path = new Path(fileName);
> >>  SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> >> path, LongWritable.class, VectorWritable.class);
> >>  long recNum = 0;
> >> VectorWritable vec = new VectorWritable();
> >> for (Vector point : points) {
> >>  vec.set(point); writer.append(new LongWritable(recNum++), vec);
> >> }
> >> writer.close();
> >>  }
> >>  public static List<Vector> getPoints(double[][] raw) {
> >>  List<Vector> points = new ArrayList<Vector>();
> >> for (int i = 0; i < raw.length; i++) {
> >>  double[] fr = raw[i];
> >> Vector vec = new RandomAccessSparseVector(fr.length);
> >>  vec.assign(fr); points.add(vec);
> >> }
> >> return points;
> >>  }
> >>  public static void main (String []args) throws Exception {
> >>  int k = 2; List<Vector> vectors = getPoints(points);
> >> File testData = new File("testdata1");
> >>  if (!testData.exists()) {
> >> testData.mkdir();
> >> }
> >>  testData = new File("testdata1/points");
> >> if (!testData.exists()) {
> >>  testData.mkdir();
> >> }
> >>  Configuration conf = new Configuration();
> >> FileSystem fs = FileSystem.get(conf);
> >>  writePointsToFile(vectors, "testdata1/points/file1", fs, conf);
> >> Path path = new Path("testdata1/clusters/part-00000");
> >>  SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> >> path, Text.class, Cluster.class);
> >>  for (int i = 0; i < k; i++) {
> >> Vector vec = vectors.get(i);
> >>  Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
> >> writer.append(new Text(cluster.getIdentifier()), cluster);
> >>  }
> >> writer.close();
> >>
> >> KMeansDriver.run(conf, new Path("testdata1/points"),
> >> new Path("testdata1/clusters"), new Path("output"), new
> >> EuclideanDistanceMeasure(), 0.001, 10, true, false);
> >>  SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
> >> Path("output/" + Cluster.CLUSTERED_POINTS_DIR
> >> + "/part-m-00000"), conf);
> >>  IntWritable key = new IntWritable();
> >> WeightedVectorWritable value = new WeightedVectorWritable();
> >>  while (reader.next(key, value)) {
> >> System.out.println(value.toString() + " belongs to cluster " +
> >> key.toString());
> >>  } reader.close();
> >> }
> >>  }
> >>
> >>
> >> Thanks,
> >> Sharath
> >>
> >>
> >> On Wed, Feb 2, 2011 at 2:43 PM, sharath jagannath <
> >> sharathjagannath@gmail.com> wrote:
> >>
> >>> I did not pass any argument. I used the default one:
> >>>
> >>>        log.info("Running with default arguments");
> >>>
> >>>       Path output = new Path("output");
> >>>
> >>>       HadoopUtil.overwriteOutput(output);
> >>>
> >>>       new Job().run(new Configuration(), new Path("testdata"), output,
> >>> new EuclideanDistanceMeasure(), 80, 55, 0.5, 10);
> >>>
> >>>
> >>>
> >>>
> >
>
>
> --
> Thanks,
> Sharath Jagannath
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message