mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jamal sasha <jamalsha...@gmail.com>
Subject Re: Converting to sequence file in mahout
Date Fri, 23 May 2014 18:05:19 GMT
Hi,
  I tried to use one of the implementation.. Here is the copy paste for the
reference

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

public class SequenceOutput{
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration(true);
FileSystem fs = FileSystem.get(conf);

// The input file is not in hdfs
BufferedReader reader = new BufferedReader(new FileReader(args[1]));
Path filePath = new Path(args[2]);
 // Delete previous file if exists
if (fs.exists(filePath))
  fs.delete(filePath, true);
 SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
 filePath, Text.class, VectorWritable.class);
 // Run through the input file
  String line;
  //System.out.prin
  System.out.println(args[3].length());
  while ((line = reader.readLine()) != null) {
 // We surround with try catch to get rid of the exception when header is

  try {
  //System.out.println(line);
  // Split with the given separator
  String[] c = line.split(args[3]);
  if (c.length > 1) {
  double[] d = new double[c.length];
  // Get the feature set
  for (int i = 1; i < c.length; i++)
  d[i] = Double.parseDouble(c[i]);
  // Put it in a vector
  Vector vec = new DenseVector(c.length);
  vec.assign(d);
  VectorWritable writable = new VectorWritable();
  writable.set(vec);

  // Create a label with a / and the class label
  String label = c[0] + "/" + c[0];

 // Write all in the seqfile
  writer.append(new Text(label), writable);
  }
  } catch (NumberFormatException e) {
  continue;
  }
  }
  writer.close();
  reader.close();
 }
}


It generates the output but then throws an error when I try to run
rowSimilarity job
14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
attempt_1400790649200_0044_m_000000_1, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
attempt_1400790649200_0044_m_000001_1, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Any clues?


On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <smarthi@apache.org> wrote:

> The input needs to be converted to a sequencefile of vectors in order to be
> processed by Mahout's pipeline. This has been asked a few times recently
> and search for Kevin Moulart's recent posts for doing this in the mail
> archives.
>
>  The converted vectors are then fed to RowIdJob with output matrix and
> docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
>
>
>
>
> On Fri, May 23, 2014 at 1:31 AM, jamal sasha <jamalshasha@gmail.com>
> wrote:
>
> > Hi,
> >    I have data where each row is comma seperated vector...
> > And these are bunch of text files...like
> > 0.123,01433,0.932
> > 0.129,0.932,0.123
> > And I want to run's mahout rowIdSimilarity module on it.. butI am
> guessing
> > the input requirement is different.
> > How do I convert this csv vectors into format consumed by mahout
> > rowIdSimilarity module?
> > Thanks
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message