mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <>
Subject [jira] Commented: (MAHOUT-6) Need a matrix implementation
Date Tue, 26 Feb 2008 23:48:51 GMT


Ted Dunning commented on MAHOUT-6:

A hash map is a great first implementation for a sparse vector.  Ultimately,
it will need to be replaced, but delaying that day is a good thing.  Also, a
really efficient structure is a pain in the ass to get exactly right.  The
hash map you have will work right off the bat.

The primary use of SparseBinaryVector is as a row or column of a
SparseBinaryMatrix.  A binary matrix is useful in cases where reduction to
binary values makes sense (many behavioral analysis cases are good for that,
as are many text analysis cases).  It only makes sense, however, when there
is beginning to be serious memory pressure since its virtue is that you save
8 bytes per value.  That can be 2/3 of the storage of some matrices.  For
some of my key programs, I need fast row and column access to very lare
binary matrices and getting 3x larger matrices to fit in memory (and buying
more memory) really helped.

I used Matrix1D out of inertia from Colt.  The only virtue to the notation
is that it makes sense to go eventually to Matrix3D and Matrix4D, but the
vector terminology is so well known that I wouldn't think it a problem.
Nobody is ever going to be confused.  Some purists might object that a
vector is an object from linear algebra whereas what we have is a
single-indexed array with a few linear algebra operations tacked on.  I am
not a purist.

> Need a matrix implementation
> ----------------------------
>                 Key: MAHOUT-6
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-6a.diff, MAHOUT-6b.diff, MAHOUT-6c.diff, MAHOUT-6d.diff,
MAHOUT-6e.diff, MAHOUT-6f.diff
> We need matrices for Mahout.
> An initial set of basic requirements includes:
> a) sparse and dense support are required
> b) row and column labels are important
> c) serialization for hadoop use is required
> d) reasonable floating point performance is required, but awesome FP is not
> e) the API should be simple enough to understand
> f) it should be easy to carve out sub-matrices for sending to different reducers
> g) a reasonable set of matrix operations should be supported, these should eventually
>     simple matrix-matrix and matrix-vector and matrix-scalar linear algebra operations,
A B, A + B, A v, A + x, v + x, u + v, dot(u, v)
>     row and column sums  
>     generalized level 2 and 3 BLAS primitives, alpha A B + beta C and A u + beta v
> h) easy and efficient iteration constructs, especially for sparse matrices
> i) easy to extend with new implementations

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message