spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shay Seng <s...@1618labs.com>
Subject Re: Save RDDs as CSV
Date Thu, 31 Oct 2013 04:16:54 GMT
Doing a coalesce will be kind of a problem... I was hoping that would be a
utility or command option  that could concat all the files together for
me...

Thanks for the replies though!



On Wed, Oct 30, 2013 at 9:07 PM, Patrick Wendell <pwendell@gmail.com> wrote:

>  You can do this if you coalesce the data first. However, this will
> put all of your final data through a single reduce tasks (so you get
> no parallelism and may overload a node):
>
> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
>
> Basically you have to chose, either you do the write in parallel and
> get a lot of files, or you do the write on one node/reducer and get a
> single file.
>
> - Patrick
>
> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <shay@1618labs.com> wrote:
> > Well that almost works... when I call
> > myrdd.saveAsTextFile("hdfs://..../my.csv")
> >
> > Instead of getting a single my.csv file, as I expect, my.csv is a
> directory
> > with a bunch parts - all of which are csv.
> > Is there some way have those files concatenated automatically?
> >
> >
> >
> >
> > On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <rosenville@gmail.com>
> wrote:
> >>
> >> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
> >> which writes one record per line:
> >>
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
> >>
> >> You could map() each entry in your RDD into a comma-separated string,
> then
> >> write those strings using saveAsTextFile().
> >>
> >>
> >>
> >>
> >> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
> >> <schumach@icsi.berkeley.edu> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> Can you use saveAsTextFile? See
> >>>
> >>>
> >>>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
> >>>
> >>> I'm not sure what the default field separator is (Tab probably) but if
> >>> you don't mind that may work? No need to collect it to the master.
> >>>
> >>> Andre
> >>>
> >>> On 10/30/2013 06:34 PM, Shay Seng wrote:
> >>> > What's the recommended way to save a RDD as a CSV on say HDFS?
> >>> > Do I have to collect the RDD and save it from the master, or is there
> >>> > someway I can write out the CSV file in parallel to HDFS?
> >>> >
> >>> >
> >>> > tks
> >>> > shay
> >>> >
> >>>
> >>
> >
>

Mime
View raw message