spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Pisupat <krishna.pisu...@gmail.com>
Subject Re: How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 08:15:19 GMT
I think there is no direct way. Did you look at using partitions to achieve
it? All the elements that satisfies a filter would belong to a partition.
Look at PartitionPruningRDD.
http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/rdd/PartitionPruningRDD.html.
May be it could help you achieve what you are trying to do.


On Tue, Sep 10, 2013 at 6:24 PM, Xiang Huo <huoxiang5659@gmail.com> wrote:

> Hi,
>
> I am try to get some sub dataset from one large dataset. I know one method
> is that i can run val small = big.filter(...) and then save this RDD as
> textFile for n times, where n is the number of sub dataset I want. But I
> wonder this there any way that I can traverse one time for the large
> dataset? Because in my case the large dataset is more than several TB and
> each record in it can only be classified in one sub dataset.
>
> Any help is appreciated.
>
> Thanks
>
> Xiang
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email: huoxiang5659@gmail.com
>            or xhuo4@uic.edu
>

Mime
View raw message