spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: Writing to multiple outputs in Spark
Date Fri, 14 Aug 2015 20:09:38 GMT
This is already supported with the new partitioned data sources in
DataFrame/SQL right?

On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini <>

> Speaking about Shopify's deployment, this would be a really nice to have
> feature.
> We would like to write data to folders with the structure
> `<year>/<month>/<day>` but have had to hold off on that because of
the lack
> of support for MultipleOutputs.
> On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis <>
> wrote:
>> Would it be right to assume that the silence on this topic implies others
>> don't really have this issue/desire?
>> On Sat, 18 Jul 2015 at 17:24 Silas Davis <> wrote:
>>> *tl;dr hadoop and cascading* *provide ways of writing tuples to
>>> multiple output files based on key, but the plain RDD interface doesn't
>>> seem to and it should.*
>>> I have been looking into ways to write to multiple outputs in Spark. It
>>> seems like a feature that is somewhat missing from Spark.
>>> The idea is to partition output and write the elements of an RDD to
>>> different locations depending based on the key. For example in a pair RDD
>>> your key may be (language, date, userId) and you would like to write
>>> separate files to $someBasePath/$language/$date. Then there would be  a
>>> version of saveAsHadoopDataset that would be able to multiple location
>>> based on key using the underlying OutputFormat. Perahps it would take a
>>> pair RDD with keys ($partitionKey, $realKey), so for example ((language,
>>> date), userId).
>>> The prior art I have found on this is the following.
>>> Using SparkSQL:
>>> The 'partitionBy' method of DataFrameWriter:
>>> This only works for parquet at the moment.
>>> Using Spark/Hadoop:
>>> This pull request (with the hadoop1 API,) :
>>> This uses MultipleTextOutputFormat (which in turn uses
>>> MultipleOutputFormat) which is part of the old hadoop1 API. It only works
>>> for text but could be generalised for any underlying OutputFormat by using
>>> MultipleOutputFormat (but only for hadoop1 - which doesn't support
>>> ParquetAvroOutputFormat for example)
>>> This gist (With the hadoop2 API):
>>> This uses MultipleOutputs (available for both the old and new hadoop
>>> APIs) and extends saveAsNewHadoopDataset to support multiple outputs.
>>> Should work for any underlying OutputFormat. Probably better implemented by
>>> extending saveAs[NewAPI]HadoopDataset.
>>> In Cascading:
>>> Cascading provides PartititionTap:
>>> to do this
>>> So my questions are: is there a reason why Spark doesn't provide this?
>>> Does Spark provide similar functionality through some other mechanism? How
>>> would it be best implemented?
>>> Since I started composing this message I've had a go at writing an
>>> wrapper OutputFormat that writes multiple outputs using hadoop
>>> MultipleOutputs but doesn't require modification of the PairRDDFunctions.
>>> The principle is similar however. Again it feels slightly hacky to use
>>> dummy fields for the ReduceContextImpl, but some of this may be a part of
>>> the impedance mismatch between Spark and plain Hadoop... Here is my
>>> attempt:
>>> I'd like to see this functionality in Spark somehow but invite
>>> suggestion of how best to achieve it.
>>> Thanks,
>>> Silas

View raw message