mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis
Date Fri, 04 Jul 2014 22:44:34 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052701#comment-14052701
] 

Hudson commented on MAHOUT-1541:
--------------------------------

SUCCESS: Integrated in Mahout-Quality #2684 (See [https://builds.apache.org/job/Mahout-Quality/2684/])
MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 fixed a build test problem, drivers have an option new
to not search for MAHOUT_HOME and SPARK_HOME (pat: rev 32badb1d360ddf514e6b253f2dea9ae7e5df078a)
* spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
* spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
* spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: New Feature
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an IndexedDataset with
BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate
params, then write output with external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will support reading
externally defined IDs and flexible formats. Output will be of the legacy format or text files
of the user's specification with reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy code if they
want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished
without any writing to an actual file so the legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message