spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Partition + equivalent of MapReduce multiple outputs
Date Thu, 29 Jan 2015 02:16:33 GMT
I think this repartitionAndSortWithinPartitions() method may be what I'm
looking for in [1]. At least it sounds like it is. Will this method allow
me to deal with sorted partitions even when the partition doesn't fit into
memory?

[1]
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala

On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet <cjnolet@gmail.com> wrote:

> I'm looking @ the ShuffledRDD code and it looks like there is a method
> setKeyOrdering()- is this guaranteed to order everything in the partition?
> I'm on Spark 1.2.0
>
> On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet <cjnolet@gmail.com> wrote:
>
>> In all of the soutions I've found thus far, sorting has been by casting
>> the partition iterator into an array and sorting the array. This is not
>> going to work for my case as the amount of data in each partition may not
>> necessarily fit into memory. Any ideas?
>>
>> On Wed, Jan 28, 2015 at 1:29 AM, Corey Nolet <cjnolet@gmail.com> wrote:
>>
>>> I wanted to update this thread for others who may be looking for a
>>> solution to his as well. I found [1] and I'm going to investigate if this
>>> is a viable solution.
>>>
>>> [1]
>>> http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
>>>
>>> On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet <cjnolet@gmail.com> wrote:
>>>
>>>> I need to be able to take an input RDD[Map[String,Any]] and split it
>>>> into several different RDDs based on some partitionable piece of the key
>>>> (groups) and then send each partition to a separate set of files in
>>>> different folders in HDFS.
>>>>
>>>> 1) Would running the RDD through a custom partitioner be the best way
>>>> to go about this or should I split the RDD into different RDDs and call
>>>> saveAsHadoopFile() on each?
>>>> 2) I need the resulting partitions sorted by key- they also need to be
>>>> written to the underlying files in sorted order.
>>>> 3) The number of keys in each partition will almost always be too big
>>>> to fit into memory.
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Mime
View raw message