spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-1133) Add a new small files input for MLlib, which will return an RDD[(fileName, content)]
Date Fri, 04 Apr 2014 18:14:16 GMT

     [ https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matei Zaharia resolved SPARK-1133.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

> Add a new small files input for MLlib, which will return an RDD[(fileName, content)]
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-1133
>                 URL: https://issues.apache.org/jira/browse/SPARK-1133
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 1.0.0
>            Reporter: Xusen Yin
>            Assignee: Xusen Yin
>            Priority: Minor
>              Labels: IO, MLLib,, hadoop
>             Fix For: 1.0.0
>
>
> As I am moving forward to write a LDA (Latent Dirichlet Allocation) implementation to
Spark MLlib, I find that a small files input API is useful, so I write a smallTextFiles()
to support it.
> smallTextFiles() digests a directory of text files, then return an RDD\[(String, String)\],
the former String is the file name, while the latter one is the contents of the text file.
> smallTextFiles() can be used for local disk I/O, or HDFS I/O, just like the textFiles()
in SparkContext. In the scenario of LDA, there are 2 common uses:
> 1. smallTextFiles() is used to preprocess local disk files, i.e. combine those files
into a huge one, then transfer it onto HDFS to do further process, such as LDA clustering.
> 2. It is also used to transfer the raw directory of small files onto HDFS (though it
is not recommended, because it will cost too many namenode entries), then clustering it directly
with LDA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message