mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis
Date Thu, 19 Jun 2014 19:34:24 GMT


ASF GitHub Bot commented on MAHOUT-1541:

Github user dlyubimov commented on the pull request:
    General note: a lot of style problems. 
    Code lines not to exceed 120 charactes (i think i saw some suspiciously long). 
    Definitely lack of comments. 
    FYI Spark comment style is the following: every comment starts with a captial letter and
is formatted to cut off at 100th character. And they are very draconian about it. And that's
what i followed here as well. 
    Well the 100th character is questionable since it cannot be auto formatted in IDEA, but
I do believe comments do need some justification applied on the right. 
    for closure stacks, i'd suggest the following comment /etc style
        val b = A
           // I want to map
          .map { tuple => 
          // I want to filter 
          .filter { tuple => 
    So it would be useful to state in plain words  what closure is to accomplish, since they
are functional units, i.e. function-grade citizens, and as such, imo deserve some explanation.

> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>                 Key: MAHOUT-1541
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
> Create a CLI driver to import data in a flexible manner, create an IndexedDataset with
BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate
params, then write output with external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will support reading
externally defined IDs and flexible formats. Output will be of the legacy format or text files
of the user's specification with reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy code if they
want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished
without any writing to an actual file so the legacy sequence file output may not be needed.
> Opinions?

This message was sent by Atlassian JIRA

View raw message