spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiang Huo <huoxiang5...@gmail.com>
Subject How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 01:24:19 GMT
Hi,

I am try to get some sub dataset from one large dataset. I know one method
is that i can run val small = big.filter(...) and then save this RDD as
textFile for n times, where n is the number of sub dataset I want. But I
wonder this there any way that I can traverse one time for the large
dataset? Because in my case the large dataset is more than several TB and
each record in it can only be classified in one sub dataset.

Any help is appreciated.

Thanks

Xiang
-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: huoxiang5659@gmail.com
           or xhuo4@uic.edu

Mime
View raw message