mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
Date Thu, 10 Jun 2010 17:58:59 GMT
Hi Jake,

When I run $ mahout vectordump --seqFile part-00000 --dictionary dict.out
--printKey, I get:

Input Path: part-00000
0    elts: {0:c}
1    elts: {1:d}
2    elts: {2:e}
Dumped 3 Vectors

Given that my original data was

id1: A A B C
id2: B D D
id3: A B B E

how am I to interpret this?  Is it printing out the characters that are
unique for a given doc id?  I was expecting to see something that would
allow me to see how similar documents were to one another.

Thanks,
Kris



2010/6/10 Jake Mannix <jake.mannix@gmail.com>

> On Thu, Jun 10, 2010 at 10:28 AM, Kris Jack <mrkrisjack@gmail.com> wrote:
> >
> > Thanks very much for the help.  I looked into the problem a little deeper
> > and found that the org.apache.mahout.utils.vectors.lucene.Driver was
> > writing
> > out LongWriters instead of IntWriters so I just changed the code in
> there.
> > Should this code be using IntWriters or LongWriters?
> >
>
> The reason why the Lucene Driver uses long is that Solr encodes uid's as
> long.  Kinda backwards, that Mahout wants ints, and Solr wants longs, but
> that's the way it is.
>
> Maybe the lucene Driver could take a boolean flag on whether to encode
> the keys as long or int?  Anyone have opinions on this?
>
>
> > After writing the to a sequence file and running your matrix
> transposition
> > and multiplication, I get an output called part-0000.  If I read it using
> $
> > mahout seqdumper --seqFile part-00000 then it outputs:
> >
>
> I would use "mahout vectordump" instead of "mahout seqdumper" and
> you'll get nicer output.
>
>  -jake
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message