crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dpo5...@gmail.com>
Subject Re: Processing many map only collections in single pipeline with spark
Date Sun, 17 Jul 2016 01:01:05 GMT
*run in parallel

On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com> wrote:

> Just out of curiosity, if you use mrpipeline does it fun on parallel?  If
> so, issue may be in spark since I believe crunch leaves it to spark to
> handle best method of execution.
>
> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>
>> Hey David,
>>
>> I have 100 active executors, each job typically only uses a few.  It’s
>> running on yarn.
>>
>> Thanks,
>> Ben
>>
>> On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>
>> What are the cluster resources available vs what a single map uses?
>>
>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com> wrote:
>>
>>> I enabled FAIR scheduling hoping that would help but only one job is
>>> showing up a time.
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
>>>
>>> Each input is of a different format, and the DoFn implementation handles
>>> them depending on instantiation parameters.
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com> wrote:
>>>
>>> Instead of using readTextFile on the pipeline, try using the read method
>>> and use the TextFileSource, which can accept in a collection of paths.
>>>
>>>
>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>>>
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>>
>>>> I have a job configured the following way:
>>>>
>>>> for (String path : paths) {
>>>>     PCollection<String> col = pipeline.readTextFile(path);
>>>>     col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“
+ path), Target.WriteMode.APPEND);
>>>> }
>>>> pipeline.done();
>>>>
>>>> It results in one spark job for each path, and the jobs run in sequence even
though there are no dependencies.  Is it possible to have the jobs run in parallel?
>>>>
>>>> Thanks,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>
>>>
>>

Mime
View raw message