mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis
Date Mon, 21 Jul 2014 21:42:39 GMT


ASF GitHub Bot commented on MAHOUT-1541:

Github user dlyubimov commented on a diff in the pull request:
    --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
    @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
       private var cached: Boolean = false
       override val context: DistributedContext = rdd.context
    +  /**
    +   * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes
    +   * [[org.apache.mahout.sparkbindings.drm
    +.CheckpointedDrmSpark#nrow]] value.
    +   * No physical changes are made to the underlying rdd, now blank rows are added as
would be done with rbind(blankRows)
    +   * @param n number to increase row cardinality by
    +   * @note should be done before any BLAS optimizer actions are performed on the matrix
or you'll get unpredictable
    +   *       results.
    +   */
    +  override def addToRowCardinality(n: Int): CheckpointedDrm[K] = {
    +    assert(n > -1)
    +    new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel )
    +  }
    --- End diff --
    -1 on this PR. 
    i am not sure what problem it is solving but there has got to be a different way to solve
it. Matlab/R semantics is deemed sufficient to solve algebraic problems historically and they
did not have a need for this. So shouldn't we. 
    if nothing else, ultimately one can always exit to RDD level and re-format RDD content
to whatever liking.

> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>                 Key: MAHOUT-1541
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
> Create a CLI driver to import data in a flexible manner, create an IndexedDataset with
BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate
params, then write output with external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will support reading
externally defined IDs and flexible formats. Output will be of the legacy format or text files
of the user's specification with reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy code if they
want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished
without any writing to an actual file so the legacy sequence file output may not be needed.
> Opinions?

This message was sent by Atlassian JIRA

View raw message