mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout: Matrix and Vector Needs (page edited)
Date Thu, 27 Mar 2008 00:01:00 GMT
Matrix and Vector Needs (MAHOUT) edited by Jeff Eastman
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/Matrix+and+Vector+Needs
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=75990&originalVersion=6&revisedVersion=7






Content:
---------------------------------------------------------------------

h1. Intro

Most ML algorithms require the ability to represent multidimensional data concisely and to
be able to easily perform common operations on that data. MAHOUT-6 introduced Vector and Matrix
datatypes of arbitrary cardinality, along with a set of common operations on their instances.
Vectors and matrices are provided with sparse and dense implementations that are memory resident
and are suitable for manipulating intermediate results within mapper, combiner and reducer
implementations. They are not intended for applications requiring vectors or matrices that
exceed the size of a single JVM, though such applications might be able to utilize them within
a larger organizing framework.

h2. Background

See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser]

h2. Vectors

Mahout supports a Vector interface that defines the following operations over all implementation
classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize,
plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements
vectors as a double[] that is storage and access efficient. The class SparseVector implements
vectors as a HashMap<Integer, Double> that is surprisingly fast and efficient. For sparse
vectors, the size() method returns the current number of elements whereas the cardinality()
method returns the number of dimensions it holds. An additional VectorView class allows views
of an underlying vector to be specified by the viewPart() method. See the JavaDocs for more
complete definitions.

h2. Matrices

Mahout also supports a Matrix interface that defines a similar set of operations over all
implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells,
like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix
implements matrices as a double[][] that is storage and access efficient. The class SparseRowMatrix
implements matrices as a Vector[] holding the rows of the matrix in a SparseVector, and the
symmetric class SparseColumnMatrix implements matrices as a Vector[] holding the columns in
a SparseVector. Each of these classes can quickly produce a given row or column, respectively.
A fourth class SparseMatrix, uses a HashMap<Integer, Vector> which is also a SparseVector.
For sparse matrices, the size() method returns an int[2] containing the actual row and column
sizes whereas the cardinality() method returns an int[2] with the number of dimensions of
each. An additional MatrixView class allows views of an underlying matrix to be specified
by the viewPart() method. See the JavaDocs for more complete definitions.

The Matrix interface does not currently provide invert or determinant methods, though these
are desirable. It is arguable that the implementations of SparseRowMatrix and SparseColumnMatrix
ought to use the HashMap<Integer, Vector> implementations and that SparseMatrix should
instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of sparse
matrices can also be envisioned that support different storage and access characteristics.
Because the arguments of assignColumn and assignRow operations accept all forms of Vector,
it is possible to construct instances of sparse matrices containing dense rows or columns.
See the JavaDocs for more complete definitions.

For applications like PageRank/TextRank, iterative approaches to calculate eigenvectors would
also be useful. Batching of row/column operations would also be useful, such as perhaps assignRow
or assighColumn accepting UnaryFunction and BinaryFunction arguments.


h2. Ideas

As Vector and Matrix implementations are currently memory-resident, very large instances greater
than available memory are not supported. An extended set of implementations that use HBase
(BigTable) in Hadoop to represent their instances would facilitate applications requiring
such large collections.  
See [MAHOUT-6|https://issues.apache.org/jira/browse/MAHOUT-6]
See [Hama|http://wiki.apache.org/hadoop/Hama]


h2. References

Have a look at the old parallel computing libraries like [ScalaPACK|http://www.netlib.org/scalapack/],
others

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence



Mime
View raw message