[ https://issues.apache.org/jira/browse/MAHOUT6?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12572720#action_12572720
]
Ted Dunning commented on MAHOUT6:

A hash map is a great first implementation for a sparse vector. Ultimately,
it will need to be replaced, but delaying that day is a good thing. Also, a
really efficient structure is a pain in the ass to get exactly right. The
hash map you have will work right off the bat.
The primary use of SparseBinaryVector is as a row or column of a
SparseBinaryMatrix. A binary matrix is useful in cases where reduction to
binary values makes sense (many behavioral analysis cases are good for that,
as are many text analysis cases). It only makes sense, however, when there
is beginning to be serious memory pressure since its virtue is that you save
8 bytes per value. That can be 2/3 of the storage of some matrices. For
some of my key programs, I need fast row and column access to very lare
binary matrices and getting 3x larger matrices to fit in memory (and buying
more memory) really helped.
I used Matrix1D out of inertia from Colt. The only virtue to the notation
is that it makes sense to go eventually to Matrix3D and Matrix4D, but the
vector terminology is so well known that I wouldn't think it a problem.
Nobody is ever going to be confused. Some purists might object that a
vector is an object from linear algebra whereas what we have is a
singleindexed array with a few linear algebra operations tacked on. I am
not a purist.
> Need a matrix implementation
> 
>
> Key: MAHOUT6
> URL: https://issues.apache.org/jira/browse/MAHOUT6
> Project: Mahout
> Issue Type: New Feature
> Reporter: Ted Dunning
> Attachments: MAHOUT6a.diff, MAHOUT6b.diff, MAHOUT6c.diff, MAHOUT6d.diff,
MAHOUT6e.diff, MAHOUT6f.diff
>
>
> We need matrices for Mahout.
> An initial set of basic requirements includes:
> a) sparse and dense support are required
> b) row and column labels are important
> c) serialization for hadoop use is required
> d) reasonable floating point performance is required, but awesome FP is not
> e) the API should be simple enough to understand
> f) it should be easy to carve out submatrices for sending to different reducers
> g) a reasonable set of matrix operations should be supported, these should eventually
include:
> simple matrixmatrix and matrixvector and matrixscalar linear algebra operations,
A B, A + B, A v, A + x, v + x, u + v, dot(u, v)
> row and column sums
> generalized level 2 and 3 BLAS primitives, alpha A B + beta C and A u + beta v
> h) easy and efficient iteration constructs, especially for sparse matrices
> i) easy to extend with new implementations

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
