mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edmond Luo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1833) One more svec function accepting cardinality as parameter
Date Tue, 19 Apr 2016 08:28:25 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247368#comment-15247368
] 

Edmond Luo edited comment on MAHOUT-1833 at 4/19/16 8:27 AM:
-------------------------------------------------------------

I have implemented the new wrapper function as shown above and added some testing code for
the new function, 
However, I am not sure if I should add some DRM test cases using sparse vector, seems now
we do not have any test case for those DRM built from sparse vector.


was (Author: resec):
I have implemented all above code and added some testing code for the new function, 
However, I am not sure if I should add some DRM test cases using sparse vector, seems now
we do not have any test case for those DRM built from sparse vector.

> One more svec function accepting cardinality as parameter 
> ----------------------------------------------------------
>
>                 Key: MAHOUT-1833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1833
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>         Environment: Mahout Spark Shell 0.12.0,
> Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1, 
> Centos7 64bit
>            Reporter: Edmond Luo
>
> It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings
> {code}
> /**
>  * create a sparse vector out of list of tuple2's with specific cardinality(size),
>  * throws IllegalArgumentException if cardinality is not bigger than required cardinality
of sdata
>  * @param cardinality sdata
>  * @return
>  */
> def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
>   val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
>   if (cardinality < required) {
>     throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than
required[%required]!")
>   }
>   val initialCapacity = sdata.size
>   val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
>   sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
>   sv
> }
> {code}
> So user can specify the cardinality for the created sparse vector.
> This is very useful and convenient if user wants to create a DRM with many sparse vectors
and the vectors are not with the same actual size(but with the same logical size, e.g. rows
of a sparse matrix).
> Below code should demonstrate the case:
> {code}
> var cardinality = 20
> val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt,
Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality,
row._2)))
> val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
> // All element wise opperation will fail for those DRM with not cardinality-consistent
SparseVector
> val drm2 = drm + drm
> val drm3 = drm - drm
> val drm4 = drm * drm
> val drm5 = drm / drm
> {code}
> Notice that in the last map, the svec in above accepts one more parameter, so the cardinality
of those created SparseVector can be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message