crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Juhn <benjij...@gmail.com>
Subject Processing many map only collections in single pipeline with spark
Date Sat, 16 Jul 2016 01:53:16 GMT
Hello,

I have a job configured the following way:
for (String path : paths) {
    PCollection<String> col = pipeline.readTextFile(path);
    col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“ + path),
Target.WriteMode.APPEND);
}
pipeline.done();
It results in one spark job for each path, and the jobs run in sequence even though there
are no dependencies.  Is it possible to have the jobs run in parallel?
Thanks,
Ben


Mime
View raw message