mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Schlaikjer <andrew.schlaik...@gmail.com>
Subject Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout
Date Fri, 01 Mar 2013 17:59:25 GMT
Hi Colum, I'm an ElephantBird project committer and wrote both
SequenceFileStorage and the VectorWritableConverter.

The default Writable type used by SequenceFileStorage for both key and
value is Text, hence the Text data when you don't provide extra
configuration.

Could you provide some sample data or task attempt logs from your job to
help diagnose the issue? Unit tests for both of these utils cover a lot of
edge cases, but if you've found a new one I'd like to get it sorted out!

Thanks,
Andy




On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I haven't touched elephant bird in some time.  I had some fits with it at
> the time that I used it whenever I strayed from the well-trod path, but I
> had heard it was much better lately.
>
> Sorry not to be much more help than that.
>
> On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <columfoley@gmail.com> wrote:
>
> > I am trying to store Mahout RandomAccessSparseVector using
> > elephant-bird and pig. The data is of the form
> > key(text),value(RandomAccessSparseVector). when I run pig describe it
> > presents the following:
> >
> > pair: {key: int,val: (cardinality: int,entries: {entry: (index:
> > int,value: double)})}
> >
> > My problem is that when I try to store tuples using elephant-bird's
> > SequenceFileStorage as follows:
> >
> > store clusteredOut into 'logsvectors.dat' using
> > com.twitter.elephantbird.pig.store.SequenceFileStorage (
> >    '-c com.twitter.elephantbird.pig.util.TextConverter',
> >    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter  --
> > -sparse'
> > );
> >
> > It runs successfully but when I examine the resulting Sequencefile all
> > the vectors are empty.
> >
> > On the other hand, if I run the following instead:
> >
> > store clusteredOut into 'logsvectors.dat' using
> > com.twitter.elephantbird.pig.store.SequenceFileStorage ();
> >
> > ie do not specify the types of the key or value.
> >
> > The vectors are non-empty but are of type text..and this causes my
> > clustering algorithm to fail(as they are expecting VectorWritable).
> >
> > So my problem is that I need to output in VectorFileFormat, but when I
> > do the resulting vectors are empty.
> >
> > Anyone else have experience with this issue?
> >
> > Many thanks,
> > Colum
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message