mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles
Date Fri, 26 Sep 2014 18:22:34 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149760#comment-14149760
] 

ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user dlyubimov commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-57000893
  
    Let's not pile all things together. DRM is DRM and sequence file is sequence file (not
DRM). 
    
    There is such a thing as DRM persistence. Since hadoop times and to date, such persistence
on (H)DFS has been only defined via persistence file. So saving to hdfs can only mean one
thing in order for data to stay DRM. 
    
    Corrollary to that are few things :(1) any sequence file is not DRM. Only a sequence file
with o.a.m.VectorWritable as value is. (2) DRM data saved to anything but sequence file cannot
be DRM.
    
    That said, custom input/output adapters  are possible. But i am against making no distinction
between text and sequence files, as one continues to be DRM while the other is just a bunch
of comma separated numbers.


> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form <Text,VectorWriteable>
 SparkEngine's drmFromHDFS method is creating rdds with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in SparkEngine.scala:

> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], minPartitions
= parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message