spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Writing to multiple outputs in Spark
Date Fri, 14 Aug 2015 20:09:38 GMT
This is already supported with the new partitioned data sources in
DataFrame/SQL right?


On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini <alex.angelini@shopify.com>
wrote:

> Speaking about Shopify's deployment, this would be a really nice to have
> feature.
>
> We would like to write data to folders with the structure
> `<year>/<month>/<day>` but have had to hold off on that because of
the lack
> of support for MultipleOutputs.
>
> On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis <silas@silasdavis.net>
> wrote:
>
>> Would it be right to assume that the silence on this topic implies others
>> don't really have this issue/desire?
>>
>> On Sat, 18 Jul 2015 at 17:24 Silas Davis <silas@silasdavis.net> wrote:
>>
>>> *tl;dr hadoop and cascading* *provide ways of writing tuples to
>>> multiple output files based on key, but the plain RDD interface doesn't
>>> seem to and it should.*
>>>
>>> I have been looking into ways to write to multiple outputs in Spark. It
>>> seems like a feature that is somewhat missing from Spark.
>>>
>>> The idea is to partition output and write the elements of an RDD to
>>> different locations depending based on the key. For example in a pair RDD
>>> your key may be (language, date, userId) and you would like to write
>>> separate files to $someBasePath/$language/$date. Then there would be  a
>>> version of saveAsHadoopDataset that would be able to multiple location
>>> based on key using the underlying OutputFormat. Perahps it would take a
>>> pair RDD with keys ($partitionKey, $realKey), so for example ((language,
>>> date), userId).
>>>
>>> The prior art I have found on this is the following.
>>>
>>> Using SparkSQL:
>>> The 'partitionBy' method of DataFrameWriter:
>>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>>
>>> This only works for parquet at the moment.
>>>
>>> Using Spark/Hadoop:
>>> This pull request (with the hadoop1 API,) :
>>> https://github.com/apache/spark/pull/4895/files.
>>>
>>> This uses MultipleTextOutputFormat (which in turn uses
>>> MultipleOutputFormat) which is part of the old hadoop1 API. It only works
>>> for text but could be generalised for any underlying OutputFormat by using
>>> MultipleOutputFormat (but only for hadoop1 - which doesn't support
>>> ParquetAvroOutputFormat for example)
>>>
>>> This gist (With the hadoop2 API):
>>> https://gist.github.com/mlehman/df9546f6be2e362bbad2
>>>
>>> This uses MultipleOutputs (available for both the old and new hadoop
>>> APIs) and extends saveAsNewHadoopDataset to support multiple outputs.
>>> Should work for any underlying OutputFormat. Probably better implemented by
>>> extending saveAs[NewAPI]HadoopDataset.
>>>
>>> In Cascading:
>>> Cascading provides PartititionTap:
>>> http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html
>>> to do this
>>>
>>> So my questions are: is there a reason why Spark doesn't provide this?
>>> Does Spark provide similar functionality through some other mechanism? How
>>> would it be best implemented?
>>>
>>> Since I started composing this message I've had a go at writing an
>>> wrapper OutputFormat that writes multiple outputs using hadoop
>>> MultipleOutputs but doesn't require modification of the PairRDDFunctions.
>>> The principle is similar however. Again it feels slightly hacky to use
>>> dummy fields for the ReduceContextImpl, but some of this may be a part of
>>> the impedance mismatch between Spark and plain Hadoop... Here is my
>>> attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462
>>>
>>> I'd like to see this functionality in Spark somehow but invite
>>> suggestion of how best to achieve it.
>>>
>>> Thanks,
>>> Silas
>>>
>>
>

Mime
View raw message