spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Schumacher <schum...@icsi.berkeley.edu>
Subject Re: Save RDDs as CSV
Date Thu, 31 Oct 2013 05:55:02 GMT

There is also the getmerge command of the HDFS shell which lets you
merge and fetch the contents of a directory, which may be exactly what
you want. From the docs:

Usage: hdfs dfs -getmerge <src> <localdst> [addnl]

Takes a source directory and a destination file as input and
concatenates files in src into the destination local file. Optionally
addnl can be set to enable adding a newline character at the end of each
file.

On 10/30/2013 09:07 PM, Patrick Wendell wrote:
>  You can do this if you coalesce the data first. However, this will
> put all of your final data through a single reduce tasks (so you get
> no parallelism and may overload a node):
> 
> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
> 
> Basically you have to chose, either you do the write in parallel and
> get a lot of files, or you do the write on one node/reducer and get a
> single file.
> 
> - Patrick
> 
> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <shay@1618labs.com> wrote:
>> Well that almost works... when I call
>> myrdd.saveAsTextFile("hdfs://..../my.csv")
>>
>> Instead of getting a single my.csv file, as I expect, my.csv is a directory
>> with a bunch parts - all of which are csv.
>> Is there some way have those files concatenated automatically?
>>
>>
>>
>>
>> On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <rosenville@gmail.com> wrote:
>>>
>>> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>>> which writes one record per line:
>>> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>>>
>>> You could map() each entry in your RDD into a comma-separated string, then
>>> write those strings using saveAsTextFile().
>>>
>>>
>>>
>>>
>>> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>>> <schumach@icsi.berkeley.edu> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Can you use saveAsTextFile? See
>>>>
>>>>
>>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>>>>
>>>> I'm not sure what the default field separator is (Tab probably) but if
>>>> you don't mind that may work? No need to collect it to the master.
>>>>
>>>> Andre
>>>>
>>>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>>>>> What's the recommended way to save a RDD as a CSV on say HDFS?
>>>>> Do I have to collect the RDD and save it from the master, or is there
>>>>> someway I can write out the CSV file in parallel to HDFS?
>>>>>
>>>>>
>>>>> tks
>>>>> shay
>>>>>
>>>>
>>>
>>


Mime
View raw message