mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> instead of SequenceFile<IntWritable,VectorWritable>
Date Thu, 10 Feb 2011 02:09:57 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992834#comment-12992834
] 

Ted Dunning commented on MAHOUT-322:
------------------------------------

It is inefficient for lots of small values (like keys), but take a look at classifier.sgd.PolymorphicWritable<T
extends Writable> for one way to have generic values in a writable.

T needs to be a superclass of whatever data is in the file.  The class name is inserted into
the file and there needs to be a simple constructor for that class.

I use this extensively for serializing and deserializing sgd models which have all kinds of
polymorphism.  For example, there are multiple kinds of Gradient, OnlineAuc, VectorClassifier
and many others.  

> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> instead
of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix states that
the matrix lives in SequenceFile<WritableComparable, VectorWritable>. Implementation,
however, assumes SequenceFile<IntWritable, VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD package,
mainly to perform PCA on a massive document corpus. Given such corpus, it makes sense to not
limit the user by forcing the document "key" to be integer. Instead, users should be able
to use Text keys (document name or id) or keys made of any other arbitrary class. One may
even argue that forcing a WritableComparable key is too limiting, and a simple Writable key
should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the SequenceFile key at
all, to allow user-specific classes (unknown to Mahout) to be used as opaque keys even when
their libraries are not available in runtime. Currently DistributedRowMatrix calls "reader.next(i,
v)"... but reader has methods to query just the value, avoiding key deserialization altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message