mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: How to convert SequenceFile<LongWritable,VectorWritable> to SequenceFile<IntWritable,VectorWritable>?
Date Wed, 25 May 2011 16:38:19 GMT
Hmm... then it looks like we're spitting out Long ids from the lucene.vector
output.

Take a look at RowIdJob - it takes SequenceFile<Text,VectorWritable> and
converts
it to a pair of sequence files: SequenceFile<IntWritable, VectorWritable>,
and
SequenceFile<IntWritable,Text> (the latter being the "dictionary" of what
int ids
correspond to what original text ids).  This job could be modified trivially
by swapping
every reference to Text to LongWritable.

On Wed, May 25, 2011 at 8:00 AM, Stefan Wienert <stefan@wienert.cc> wrote:

> Yes, with:
> bin/mahout lucene.vector \
>        --dir /home/hadoop/MahoutStatements/tf_index \
>        --field fulltext \
>        --dictOut /home/hadoop/MahoutStatements/dict.txt \
>        --output /home/hadoop/MahoutStatements/tfidf-vectors  \
>        --idField id \
>        --weight TFIDF
>
> 2011/5/25 Jake Mannix <jake.mannix@gmail.com>:
> > Did you rebuild your tfidf-vectors with trunk as well?
> >
> > On Wed, May 25, 2011 at 6:59 AM, Stefan Wienert <stefan@wienert.cc>
> wrote:
> >
> >> First, I use http://svn.apache.org/repos/asf/mahout/trunk, tested some
> >> minutes ago with the newest version.
> >>
> >> And still:
> >> bin/mahout transpose \
> >> --input /home/hadoop/MahoutStatements/tfidf-vectors \
> >> --numRows 227 \
> >> --numCols 107909 \
> >> --tempDir /home/hadoop/MahoutStatements/tfidf-matrix/transpose
> >> produces:
> >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> >> be cast to org.apache.hadoop.io.IntWritable
> >>
> >> My first idea to change "lucene.vector" does not work, there is too
> >> much to change.
> >>
> >> So... Ideas? What about changing "transpose" and "matrixmult" to use
> >> LongWritable instead of IntWritable? Is this problematically?
> >>
> >> 2011/5/25 Jake Mannix <jake.mannix@gmail.com>:
> >> > On Wed, May 25, 2011 at 6:14 AM, Stefan Wienert <stefan@wienert.cc>
> >> wrote:
> >> >
> >> >> So the real problem is, that "transpose" and "matrixmult" (maybe)
> >> >> still uses IntWritable instead of LongWritable".
> >> >>
> >> >
> >> > It's the other way around: matrix operations use keys which are ints,
> and
> >> > the lucene.vector class needs to respect this.  It doesn't on current
> >> trunk?
> >> >
> >> >  -jake
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
> Mobil: +49176-40170270
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message