mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: populating and serializing large sparse matrices
Date Thu, 11 Jun 2015 22:53:01 GMT
I guess you are talking DRM format (sequence file).

current recommended way is to use mahout-samsara with e.g. Spark (no
mapreduce support there). Translation of in-core matrix (sparse, for
example) would take converting it to distributed matrix (DRM) first by
means of drmParallelize [1] and then saving it to hdfs by means of dfwWrite
[2] (the doc's method name for saving matrix is outdated a bit there).

It does imply Spark cluster (although you can always run it in local mode
which is just as good as completely in-core save).


On Thu, Jun 11, 2015 at 1:53 PM, Patrice Seyed <> wrote:

> Hi,
> I'm looking for a good solution to populate and serialize a large
> sparse matrix using Mahout or related libraries. I noticed
> SparseMatrix is not serializable when I considered serializing this
> java object to file.  In an experiment to serialize out to a sequence
> file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row
> the sequence file
> was taking about 115 GB space. Lucene is another idea but has similar
> demand on disk space.
> Are there more efficient ways of serializing a matrix to disk? Is
> there something akin to python's ndarray? (Which I have noticed
> handles quite large spare matrices population/serialization well.)
> The object DistributedRowMatrix was mentioned to me but, 1) does it
> suit my use case? The constructor takes a sequence file as an argument
> (the generation of which I am having the issue with), 2) there is not
> a method for accessing a row at an index, which I would need.
> Thanks in advance for any suggestions,
> Best,
> Patrice

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message