crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Morgan <unlucky...@hotmail.com>
Subject Planning Optimization for Sort
Date Tue, 13 Jan 2015 21:11:27 GMT
Hi Everyone,
I have a crunch job that reads some data from s3 and applies a simple MapFn and then does
a total order sort.
PCollection<String> rawdata = readTextFile("s3n://data");PCollection<String> data
= rawdata.parallelDo(new myMapFn());Sort.sort(data); 
I noticed that Sort from the sort library works in two phases the former being called the
presort phase. When I execute this pipeline as is the data is read and transformed three times,
the first time to generate the PCollections, second time for the presort phase, and third
for the final sort.
The snippet below ends up only reading the data from s3 once.
PCollection<String> rawdata = readTextFile("s3n://data");PCollection<String> data
= rawdata.parallelDo(new myMapFn());data.cache();pipeline.run();Sort.sort(data);
Might be a crunch planner optimization opportunity?
Thanks!
Danny 		 	   		  
Mime
View raw message