spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Angelini <>
Subject Re: Writing to multiple outputs in Spark
Date Fri, 14 Aug 2015 15:04:28 GMT
Speaking about Shopify's deployment, this would be a really nice to have

We would like to write data to folders with the structure
`<year>/<month>/<day>` but have had to hold off on that because of the lack
of support for MultipleOutputs.

On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis <> wrote:

> Would it be right to assume that the silence on this topic implies others
> don't really have this issue/desire?
> On Sat, 18 Jul 2015 at 17:24 Silas Davis <> wrote:
>> *tl;dr hadoop and cascading* *provide ways of writing tuples to multiple
>> output files based on key, but the plain RDD interface doesn't seem to and
>> it should.*
>> I have been looking into ways to write to multiple outputs in Spark. It
>> seems like a feature that is somewhat missing from Spark.
>> The idea is to partition output and write the elements of an RDD to
>> different locations depending based on the key. For example in a pair RDD
>> your key may be (language, date, userId) and you would like to write
>> separate files to $someBasePath/$language/$date. Then there would be  a
>> version of saveAsHadoopDataset that would be able to multiple location
>> based on key using the underlying OutputFormat. Perahps it would take a
>> pair RDD with keys ($partitionKey, $realKey), so for example ((language,
>> date), userId).
>> The prior art I have found on this is the following.
>> Using SparkSQL:
>> The 'partitionBy' method of DataFrameWriter:
>> This only works for parquet at the moment.
>> Using Spark/Hadoop:
>> This pull request (with the hadoop1 API,) :
>> This uses MultipleTextOutputFormat (which in turn uses
>> MultipleOutputFormat) which is part of the old hadoop1 API. It only works
>> for text but could be generalised for any underlying OutputFormat by using
>> MultipleOutputFormat (but only for hadoop1 - which doesn't support
>> ParquetAvroOutputFormat for example)
>> This gist (With the hadoop2 API):
>> This uses MultipleOutputs (available for both the old and new hadoop
>> APIs) and extends saveAsNewHadoopDataset to support multiple outputs.
>> Should work for any underlying OutputFormat. Probably better implemented by
>> extending saveAs[NewAPI]HadoopDataset.
>> In Cascading:
>> Cascading provides PartititionTap:
>> to do this
>> So my questions are: is there a reason why Spark doesn't provide this?
>> Does Spark provide similar functionality through some other mechanism? How
>> would it be best implemented?
>> Since I started composing this message I've had a go at writing an
>> wrapper OutputFormat that writes multiple outputs using hadoop
>> MultipleOutputs but doesn't require modification of the PairRDDFunctions.
>> The principle is similar however. Again it feels slightly hacky to use
>> dummy fields for the ReduceContextImpl, but some of this may be a part of
>> the impedance mismatch between Spark and plain Hadoop... Here is my
>> attempt:
>> I'd like to see this functionality in Spark somehow but invite suggestion
>> of how best to achieve it.
>> Thanks,
>> Silas

View raw message