crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Processing many map only collections in single pipeline with spark
Date Sun, 17 Jul 2016 03:31:08 GMT
The TL;DR is that Spark doesn't really have a proper multiple outputs model
ala Crunch/MR-- i.e., jobs aren't kicked off until you do some sort of
write, and as soon as you do a write, Spark myopically executes all of the
code that needs to happen in order for that write to be completed. You need
to be fairly clever about sequencing your writes and doing intermediate
caching to make sure your pipeline executes efficiently.

On the Crunch/MR side, I'm a little surprised that we're only executing the
map-only jobs one at a time-- I'm assuming you're not mucking with the
crunch.max.running.jobs parameter in some way? (see:
https://crunch.apache.org/user-guide.html#mrpipeline )


J

On Sat, Jul 16, 2016 at 7:29 PM, David Ortiz <dpo5003@gmail.com> wrote:

> Hmm.  Just out of curiosity, what if you do Pipeline.read in place of
> readTextFile?
>
> On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>
>> Nope, it queues up the jobs in series there too.
>>
>> On Jul 16, 2016, at 6:01 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>
>> *run in parallel
>>
>> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com> wrote:
>>
>>> Just out of curiosity, if you use mrpipeline does it fun on parallel?
>>> If so, issue may be in spark since I believe crunch leaves it to spark to
>>> handle best method of execution.
>>>
>>> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>>
>>>> Hey David,
>>>>
>>>> I have 100 active executors, each job typically only uses a few.  It’s
>>>> running on yarn.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>>>
>>>> What are the cluster resources available vs what a single map uses?
>>>>
>>>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>>>
>>>>> I enabled FAIR scheduling hoping that would help but only one job is
>>>>> showing up a time.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
>>>>>
>>>>> Each input is of a different format, and the DoFn implementation
>>>>> handles them depending on instantiation parameters.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Instead of using readTextFile on the pipeline, try using the read
>>>>> method and use the TextFileSource, which can accept in a collection of
>>>>> paths.
>>>>>
>>>>>
>>>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com
>>>>> > wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I have a job configured the following way:
>>>>>>
>>>>>> for (String path : paths) {
>>>>>>     PCollection<String> col = pipeline.readTextFile(path);
>>>>>>     col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“
+ path), Target.WriteMode.APPEND);
>>>>>> }
>>>>>> pipeline.done();
>>>>>>
>>>>>> It results in one spark job for each path, and the jobs run in sequence
even though there are no dependencies.  Is it possible to have the jobs run in parallel?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>

Mime
View raw message