spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Silas Davis (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
Date Thu, 20 Aug 2015 14:05:47 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704923#comment-14704923
] 

Silas Davis edited comment on SPARK-3533 at 8/20/15 2:05 PM:
-------------------------------------------------------------

[~nchammas] I don't have implementations for Python or Java, other than that they could use
the same OutputFormat to write multiple outputs using the current Spark API, however I'd be
willing to try and put something together.

At this stage though I think it might be a bit premature for PR as I what I wrote deliberately
works without changing existing Spark code, but I have a feeling that a more elegant solution
might be reached by transplanting some code from MultipleOutputsFormat to PairRDDFunctions.
Has anyone had a chance to grok what I'm doing in the gist? Would it be a good idea to parameterise
saveAsNewAPIHadoopDataset so that it can use MultipleOutputs directly? I might see if I can
work out something sensible along these lines and we can compare the approaches.

Another thing is that I've only just noticed that this ticket referes to saveAsTextFileByKey,
in my gist you can see I have also have variants for saveAsMultipleAvroFiles and saveAsMultipleParquetFiles
using the same approach, we don't have to include these specific helpers but I think we should
generalise this for multiple outputs for any OutputFormat. Can we expand this ticket, or should
I open a new one?

[~saurfang] with reference to what I've said above I think it would be better to provide a
solution for any type of multiple outputs, not just text, as they lend themselves to a unified
approach and we might as well kill n birds with one stone. Also Hadoop 2 has no equivalent
of MultipleTextOutputFormat, whereas Hadoop 1 does have a MultipleOutputs class which seems
largely similar, so I think we can use an approach involving MultipleOuputs for both Hadoop
1 and 2. So my personal opinion would be to not put together a PR based on MultipleTextOutputFormat.
I would welcome your assistance on a more general PR/comments on the approach though.


was (Author: silasdavis):
[~nchammas] I don't have implementations for Python or Java, other than that they could use
the same OutputFormat to write multiple outputs using the current Spark API, however I'd be
willing to try and put something together.

At this stage though I think it might be a bit premature for PR as I what I wrote deliberately
works without changing existing Spark code, but my feeling is that a more elegant solution
could be reached by implanting some of the code I have in MultipleOutputsFormat into code
in PairRDDFunctions. Has anyone had a chance to grok what I'm doing in the gist? Would it
be a good idea to parameterise saveAsNewAPIHadoopDataset so that it can use MultipleOutputs?
I might see if I can work out something sensible along these lines and we can compare the
approaches.

Another thing is that I've only just noticed that this ticket referes to saveAsTextFileByKey,
in my gist you can see I have also have variants for saveAsMultipleAvroFiles and saveAsMultipleParquetFiles
using the same approach, we don't have to include these specific helpers but I think we should
generalise this for multiple outputs for any OutputFormat. Can we expand this ticket, or should
I open a new one?

[~saurfang] with reference to what I've said above I think it would be better to provide a
solution for any type of multiple outputs, not just text, as they lend themselves to a unified
approach and we might as well kill n birds with one stone. Also Hadoop 2 has no equivalent
of MultipleTextOutputFormat, whereas Hadoop 1 does have a MultipleOutputs class which seems
largely similar, so I think we can use the same approach for Hadoop 1 and 2 if we use an approach
involving MultipleOuputs. So my personal opinion would be to not put together a PR based on
MultipleTextOutputFormat. I would welcome your assistance on a more general PR/comments on
the approach though.

> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
>                 Key: SPARK-3533
>                 URL: https://issues.apache.org/jira/browse/SPARK-3533
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to multiple locations
based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda
x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so that I have
one output directory per distinct key. Each output directory could potentially have multiple
{{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}},
{{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't
straightforward. It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes
it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message