From user-return-3670-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Thu Jun 10 17:28:57 2010 Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 32911 invoked from network); 10 Jun 2010 17:28:57 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Jun 2010 17:28:57 -0000 Received: (qmail 41593 invoked by uid 500); 10 Jun 2010 17:28:57 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 41492 invoked by uid 500); 10 Jun 2010 17:28:56 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 41484 invoked by uid 99); 10 Jun 2010 17:28:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Jun 2010 17:28:56 +0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mrkrisjack@gmail.com designates 209.85.161.42 as permitted sender) Received: from [209.85.161.42] (HELO mail-fx0-f42.google.com) (209.85.161.42) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Jun 2010 17:28:52 +0000 Received: by fxm5 with SMTP id 5so182500fxm.1 for ; Thu, 10 Jun 2010 10:28:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=X0XV0VdhO+1RtLoTR6Rkc5ZXepgjfBhD2ivWJOp7OYo=; b=G9LHFkVQQjexp7mI1pJ70tu5A7jP0Jgdo2uaRdBpe3nOy1tvfTP05P+N1w3CHMikSp Q1UmpuPPeOqV/tHP7r0OUscWbnqvVVtbmJ1WutY3uuHRN/SFQMcHs/yKu9Qz4VgG/jXD ui4qUpfhbyq9JKy2Pke9k4LBuvnnG8BXmLvr0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=S/mMIzynKn2s9c/Ux3SPT/Df9yfO5YbOjwrZioPODR2kU1KAlZQD5rBvdCmWDGWDgY K+PoJS7aedf/SiOeByeLpjtNYJ6zT3rH2k/8TxgFcbNbKA9n6Hfe0K+NcftE1vROFNrZ DvvHwWOeDSBDjG3aYa4ANtmONoQfbyROnca+Y= Received: by 10.223.62.202 with SMTP id y10mr602721fah.100.1276190910276; Thu, 10 Jun 2010 10:28:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.121.200 with HTTP; Thu, 10 Jun 2010 10:28:10 -0700 (PDT) In-Reply-To: References: From: Kris Jack Date: Thu, 10 Jun 2010 18:28:10 +0100 Message-ID: Subject: Re: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=00151747c3d2df642d0488b05934 --00151747c3d2df642d0488b05934 Content-Type: text/plain; charset=ISO-8859-1 Hi Jake, Thanks very much for the help. I looked into the problem a little deeper and found that the org.apache.mahout.utils.vectors.lucene.Driver was writing out LongWriters instead of IntWriters so I just changed the code in there. Should this code be using IntWriters or LongWriters? I managed to get the similarity matrix to be written to disk but I'm not at all sure about the results. My original input was 3 solr documents: id1: A A B C id2: B D D id3: A B B E After writing the to a sequence file and running your matrix transposition and multiplication, I get an output called part-0000. If I read it using $ mahout seqdumper --seqFile part-00000 then it outputs: Input Path: part-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: org.apache.mahout.math.VectorWritable@288051 Key: 1: Value: org.apache.mahout.math.VectorWritable@288051 Key: 2: Value: org.apache.mahout.math.VectorWritable@288051 Count: 3 Is this what is to be expected? Thanks, Kris 2010/6/10 Jake Mannix > Yeah, you simply can't cast between IntWritable and LongWritable, sadly. > You need to convert your Long document ids to Integer. Since you're > pulling > documents from Solr, the docIds should be sequential and start small, > in which case they're all well under Integer.MAX_VALUE, and so a trivial > MapReduce (well, Map, no Reduce) job with a Mapper like this should work: > > public class M extends Mapper Writable> > { > private final IntWritable i = new IntWritable(0); > public void map(LongWritable key, Writable value, Context c) > { > i.set((int)k.get()); > c.collect(i, value); > } > } > > Run that over your input first, and you should be set. > > -jake > > On Thu, Jun 10, 2010 at 7:20 AM, Kris Jack wrote: > > > Got a little further by making some more class changes... > > > > // > > public class GenSimMatrixJob extends AbstractJob { > > > > public GenSimMatrixJob() { > > > > } > > > > @Override > > public int run(String[] strings) throws Exception { > > addOption("numDocs", "nd", "Number of documents in the input"); > > addOption("numTerms", "nt", "Number of terms in the input"); > > > > Map parsedArgs = parseArguments(strings); > > if (parsedArgs == null) { > > // FIXME > > return 0; > > } > > > > Configuration originalConf = getConf(); > > String inputPathString = originalConf.get("mapred.input.dir"); > > String outputTmpPathString = parsedArgs.get("--tempDir"); > > int numDocs = Integer.parseInt(parsedArgs.get("--numDocs")); > > int numTerms = Integer.parseInt(parsedArgs.get("--numTerms")); > > > > DistributedRowMatrix text = new > > DistributedRowMatrix(inputPathString, > > outputTmpPathString, numDocs, numTerms); > > > > text.configure(new JobConf(getConf())); > > > > DistributedRowMatrix transpose = text.transpose(); > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > > > System.out.println("Similarity matrix lives: " + > > similarity.getRowPath()); > > > > return 1; > > } > > > > public static void main(String[] args) throws Exception { > > ToolRunner.run(new GenSimMatrixJob(), args); > > } > > > > } > > // > > > > Giving the error... > > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > further > > details. > > 10-Jun-2010 15:16:28 org.apache.hadoop.metrics.jvm.JvmMetrics init > > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient > > configureCommandLineOptions > > WARNING: Use GenericOptionsParser for parsing the arguments. Applications > > should implement Tool for the same. > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient > > configureCommandLineOptions > > WARNING: No job jar file set. User classes may not be found. See > > JobConf(Class) or JobConf#setJar(String). > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.FileInputFormat listStatus > > INFO: Total input paths to process : 1 > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Running job: job_local_0001 > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.FileInputFormat listStatus > > INFO: Total input paths to process : 1 > > 10-Jun-2010 15:16:28 org.apache.hadoop.util.NativeCodeLoader > > WARNING: Unable to load native-hadoop library for your platform... using > > builtin-java classes where applicable > > 10-Jun-2010 15:16:28 org.apache.hadoop.io.compress.CodecPool > > getDecompressor > > INFO: Got brand-new decompressor > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.MapTask runOldMapper > > INFO: numReduceTasks: 1 > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > > INFO: io.sort.mb = 100 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > > INFO: data buffer = 79691776/99614720 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > > INFO: record buffer = 262144/327680 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.LocalJobRunner$Job run > > WARNING: job_local_0001 > > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be > > cast to org.apache.hadoop.io.IntWritable > > at > > > > > org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: map 0% reduce 0% > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Job complete: job_local_0001 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.Counters log > > INFO: Counters: 0 > > > > > > > > 2010/6/10 Kris Jack > > > > > In the attempt to create a document-document similarity matrix, I am > > > getting the following error: > > > > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > > further > > > details. > > > 10-Jun-2010 13:25:04 org.apache.hadoop.metrics.jvm.JvmMetrics init > > > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= > > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.JobClient > > > configureCommandLineOptions > > > WARNING: Use GenericOptionsParser for parsing the arguments. > Applications > > > should implement Tool for the same. > > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.JobClient > > > configureCommandLineOptions > > > WARNING: No job jar file set. User classes may not be found. See > > > JobConf(Class) or JobConf#setJar(String). > > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.FileInputFormat > listStatus > > > INFO: Total input paths to process : 1 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > > INFO: Running job: job_local_0001 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.FileInputFormat > listStatus > > > INFO: Total input paths to process : 1 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.util.NativeCodeLoader > > > WARNING: Unable to load native-hadoop library for your platform... > using > > > builtin-java classes where applicable > > > 10-Jun-2010 13:25:05 org.apache.hadoop.io.compress.CodecPool > > > getDecompressor > > > INFO: Got brand-new decompressor > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask runOldMapper > > > INFO: numReduceTasks: 1 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > > > > INFO: io.sort.mb = 100 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > > > > INFO: data buffer = 79691776/99614720 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > > > > INFO: record buffer = 262144/327680 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.LocalJobRunner$Job run > > > WARNING: job_local_0001 > > > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > be > > > cast to org.apache.hadoop.io.IntWritable > > > at > > > > > > org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1) > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > > > at > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > > INFO: map 0% reduce 0% > > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > > INFO: Job complete: job_local_0001 > > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.Counters log > > > INFO: Counters: 0 > > > Exception in thread "main" java.lang.RuntimeException: > > java.io.IOException: > > > Job failed! > > > at > > > > > > org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:163) > > > at > > > > > > org.apache.mahout.math.hadoop.GenSimMatrixLocal.generateMatrix(GenSimMatrixLocal.java:24) > > > at > > > > > > org.apache.mahout.math.hadoop.GenSimMatrixLocal.main(GenSimMatrixLocal.java:34) > > > Caused by: java.io.IOException: Job failed! > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > > > at > > > > > > org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:158) > > > ... 2 more > > > > > > > > > I created a test solr index with 3 documents and generated a sparse > > feature > > > matrix out of it using mahout's > > > org.apache.mahout.utils.vectors.lucene.Driver. > > > > > > I then ran the following code using the sparse feature matrix as input > > > (mahoutIndexTFIDF.vec). > > > > > > { > > > private void generateMatrix() { > > > String inputPath = "/home/kris/data/mahoutIndexTFIDF.vec"; > > > String tmpPath = "/tmp/matrixMultiplySpace"; > > > int numDocuments = 3; > > > int numTerms = 4; > > > > > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > > > tmpPath, numDocuments, numTerms); > > > > > > JobConf conf = new JobConf("similarity job"); > > > text.configure(conf); > > > > > > DistributedRowMatrix transpose = text.transpose(); > > > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > > > > > System.out.println("Similarity matrix lives: " + > > > similarity.getRowPath()); > > > } > > > > > > public static void main (String [] args) { > > > GenSimMatrixLocal similarity = new GenSimMatrixLocal(); > > > > > > similarity.generateMatrix(); > > > } > > > } > > > > > > Anyone see why there is a problem between LongWritable and IntWritable > > > casting? Does it need to be configured differently? > > > > > > Thanks, > > > Kris > > > > > > > > > > > > > > > > > > -- > > Dr Kris Jack, > > http://www.mendeley.com/profiles/kris-jack/ > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ --00151747c3d2df642d0488b05934--