flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Job Manager Configuration
Date Wed, 08 Nov 2017 15:17:23 GMT
Quick question Regina: Which version of Flink are you running?

Cheers,
Till

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <till.rohrmann@gmail.com>
wrote:

> Hi Regina,
>
> the user code is uploaded once to the `JobManager` and then downloaded
> from each `TaskManager` once when it first receives the command to execute
> the first task of your job.
>
> As Chesnay said there is no fundamental limitation to the size of the
> Flink job. However, it might be the case that you have configured your job
> sub-optimally. You said that you have 300 parallel flows. Depending on
> whether you've defined separate slot sharing groups for them or not, it
> might be the case that parallel subtasks of all 300 parallel jobs share the
> same slot (if you haven't changed the slot sharing group). Depending on
> what you calculate, this can be inefficient because the individual tasks
> don't get much computation time. Moreover, all tasks will allocate some
> objects on the heap which can lead to more GC. Therefore, it might make
> sense to group some of the jobs together and run these jobs in batches
> after the previous batch completed. But this is hard to say without knowing
> the details of your job and getting a glimpse at the JobManager logs.
>
> Concerning the exception you're seeing, it would also be helpful to see
> the logs of the client and the JobManager. Actually, the scheduling of the
> job is independent of the response. Only the creation of the ExecutionGraph
> and making the JobGraph highly available in case of an HA setup are
> executed before the JobManager acknowledges the job submission. Only if
> this acknowledge message is not received in time on the client side, then
> the SubmissionTimeoutException is thrown. Therefore, I assume that somehow
> the JobManager is too busy or kept from sending the acknowledge message.
>
> Cheers,
> Till
>
>
>
> On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <Regina.Chan@gs.com> wrote:
>
>> Does it copy per TaskManager or per operator? I only gave it 10
>> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and
>> running when it has the resources to.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Chesnay Schepler [mailto:chesnay@apache.org]
>> *Sent:* Wednesday, November 01, 2017 7:09 AM
>> *To:* user@flink.apache.org
>> *Subject:* Re: Job Manager Configuration
>>
>>
>>
>> AFAIK there is no theoretical limit on the size of the plan, it just
>> depends on the available resources.
>>
>>
>> The job submissions times out since it takes too long to deploy all the
>> operators that the job defines. With 300 flows, each with 6 operators
>> you're looking at potentially (1800 * parallelism) tasks that have to be
>> deployed. For each task Flink copies the user-code of *all* flows to the
>> executing TaskManager, which the network may just not be handle in time.
>>
>> I suggest to split your job into smaller batches or even run each of them
>> independently.
>>
>> On 31.10.2017 16:25, Chan, Regina wrote:
>>
>> Asking an additional question, what is the largest plan that the
>> JobManager can handle? Is there a limit? My flows don’t need to run in
>> parallel and can run independently. I wanted them to run in one single job
>> because it’s part of one logical commit on my side.
>>
>>
>>
>> Thanks,
>>
>> Regina
>>
>>
>>
>> *From:* Chan, Regina [Tech]
>> *Sent:* Monday, October 30, 2017 3:22 PM
>> *To:* 'user@flink.apache.org'
>> *Subject:* Job Manager Configuration
>>
>>
>>
>> Flink Users,
>>
>>
>>
>> I have about 300 parallel flows in one job each with 2 inputs, 3
>> operators, and 1 sink which makes for a large job. I keep getting the below
>> timeout exception but I’ve already set it to a 30 minute time out with a
>> 6GB heap on the JobManager? Is there a heuristic to better configure the
>> job manager?
>>
>>
>>
>> Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException:
>> Job submission to the JobManager timed out. You may increase
>> 'akka.client.timeout' in case the JobManager needs more time to configure
>> and confirm the job submission.
>>
>>
>>
>> *Regina Chan*
>>
>> *Goldman Sachs* *–* Enterprise Platforms, Data Architecture
>>
>> *30 Hudson Street, 37th floor | Jersey City, NY 07302
>> <https://maps.google.com/?q=30+Hudson+Street,+37th+floor+%7C+Jersey+City,+NY+07302&entry=gmail&source=g>*
>> (  (212) 902-5697
>>
>>
>>
>>
>>
>
>

Mime
View raw message