flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay Balakrishnan <bvija...@gmail.com>
Subject Re: Flink Dashboard UI Tasks hard limit
Date Thu, 28 May 2020 01:30:40 GMT
Hi Xintong,
Looks like the issue is not fully resolved :( Attaching 2 screenshots of
the memory consumption of 1 of the TaskManagers.

To increase the used up Direct memory off heap,Do I change this:
 taskmanager.memory.task.off-heap.size: 5gb

I had increased the taskmanager.network.memory.max: 24gb
which seems excessive.

1 of the errors I saw in the Flink logs:

java.io.IOException: Insufficient number of network buffers: required 1,
but only 0 available. The total number of network buffers is currently set
to 85922 of 32768 bytes each. You can increase this number by setting the
configuration keys 'taskmanager.network.memory.fraction',
'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:281)
at
org.apache.flink.runtime.io.network.partition.ResultPartitionFactory.lambda$createBufferPoolFactory$0(ResultPartitionFactory.java:191)

TIA,


On Wed, May 27, 2020 at 9:06 AM Vijay Balakrishnan <bvijaykr@gmail.com>
wrote:

> Thanks so much, Xintong for guiding me through this. I looked at the Flink
> logs to see the errors.
> I had to change taskmanager.network.memory.max: 4gb and akka.ask.timeout:
> 240s to increase the number of tasks.
> Now, I am able to increase the number of Tasks/ aka Task vertices.
>
> taskmanager.network.memory.fraction: 0.15
> taskmanager.network.memory.max: 4gb
> taskmanager.network.memory.min: 500mb
> akka.ask.timeout: 240s
>
> On Tue, May 26, 2020 at 8:42 PM Xintong Song <tonysong820@gmail.com>
> wrote:
>
>> Could you also explain how do you set the parallelism when getting this
>> execution plan?
>> I'm asking because this json file itself only shows the resulted
>> execution plan. It is not clear to me what is not working as expected in
>> your case. E.g., you set the parallelism for an operator to 10 but the
>> execution plan only shows 5.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Wed, May 27, 2020 at 3:16 AM Vijay Balakrishnan <bvijaykr@gmail.com>
>> wrote:
>>
>>> Hi Xintong,
>>> Thanks for the excellent clarification for tasks.
>>>
>>> I attached a sample screenshot above and din't reflect the slots used
>>> and the tasks limit I was running into in that pic.
>>>
>>> I am attaching my Execution plan here. Please let me know how I can
>>> increase the nmber of tasks aka parallelism. As  increase the parallelism,
>>> i run into this bottleneck with the tasks.
>>>
>>> BTW - The https://flink.apache.org/visualizer/ is a great start to see
>>> this.
>>> TIA,
>>>
>>> On Sun, May 24, 2020 at 7:52 PM Xintong Song <tonysong820@gmail.com>
>>> wrote:
>>>
>>>> Increasing network memory buffers (fraction, min, max) seems to
>>>>> increase tasks slightly.
>>>>
>>>> That's wired. I don't think the number of network memory buffers have
>>>> anything to do with the task amount.
>>>>
>>>> Let me try to clarify a few things.
>>>>
>>>> Please be aware that, how many tasks a Flink job has, and how many
>>>> slots a Flink cluster has, are two different things.
>>>> - The number of tasks are decided by your job's parallelism and
>>>> topology. E.g., if your job graph have 3 vertices A, B and C, with
>>>> parallelism 2, 3, 4 respectively. Then you would have totally 9 (2+3+4)
>>>> tasks.
>>>> - The number of slots are decided by number of TMs and slots-per-TM.
>>>> - For streaming jobs, you have to make sure the number of slots is
>>>> enough for executing all your tasks. The number of slots needed for
>>>> executing your job is by default the max parallelism of your job graph
>>>> vertices. Take the above example, you would need 4 slots, because it's the
>>>> max among all the vertices' parallelisms (2, 3, 4).
>>>>
>>>> In your case, the screenshot shows that you job has 9621 tasks in total
>>>> (not around 18000, the dark box shows total tasks while the green box shows
>>>> running tasks), and 600 slots are in use (658 - 58) suggesting that the max
>>>> parallelism of your job graph vertices is 600.
>>>>
>>>> If you want to increase the number of tasks, you should increase your
>>>> job parallelism. There are several ways to do that.
>>>>
>>>>    - In your job codes (assuming you are using DataStream API)
>>>>       - Use `StreamExecutionEnvironment#setParallelism()` to set
>>>>       parallelism for all operators.
>>>>       - Use `SingleOutputStreamOperator#setParallelism()` to set
>>>>       parallelism for a specific operator. (Only supported for subclasses
of
>>>>       `SingleOutputStreamOperator`.)
>>>>    - When submitting your job, use `-p <parallelism>` as an argument
>>>>    for the `flink run` command, to set parallelism for all operators.
>>>>    - Set `parallelism.default` in your `flink-conf.yaml`, to set a
>>>>    default parallelism for your jobs. This will be used for jobs that have
not
>>>>    set parallelism with neither of the above methods.
>>>>
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>> On Sat, May 23, 2020 at 1:11 AM Vijay Balakrishnan <bvijaykr@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Xintong,
>>>>> Thx for your reply.  Increasing network memory buffers (fraction,
>>>>> min, max) seems to increase tasks slightly.
>>>>>
>>>>> Streaming job
>>>>> Standalone
>>>>>
>>>>> Vijay
>>>>>
>>>>> On Fri, May 22, 2020 at 2:49 AM Xintong Song <tonysong820@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vijay,
>>>>>>
>>>>>> I don't think your problem is related to number of opening files.
The
>>>>>> parallelism of your job is decided before actually tries to open
the files.
>>>>>> And if the OS limit for opening files is reached, you should see
a job
>>>>>> execution failure, instead of a success execution with a lower parallelism.
>>>>>>
>>>>>> Could you share some more information about your use case?
>>>>>>
>>>>>>    - What kind of job are your executing? Is it a streaming or batch
>>>>>>    processing job?
>>>>>>    - Which Flink deployment do you use? Standalone? Yarn?
>>>>>>    - It would be helpful if you can share the Flink logs.
>>>>>>
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 20, 2020 at 11:50 PM Vijay Balakrishnan <
>>>>>> bvijaykr@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I have increased the number of slots available but the Job is
not
>>>>>>> using all the slots but runs into this approximate 18000 Tasks
limit.
>>>>>>> Looking into the source code, it seems to be opening file -
>>>>>>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/io/FileOutputFormat.java#L203
>>>>>>> So, do I have to tune the ulimit or something similar at the
Ubuntu
>>>>>>> O/S level to increase number of tasks available ? What I am confused
about
>>>>>>> is the ulimit is per machine but the ExecutionGraph is across
many machines
>>>>>>> ? Please pardon my ignorance here. Does number of tasks equate
to number of
>>>>>>> open files. I am using 15 slots per TaskManager on AWS m5.4xlarge
which has
>>>>>>> 16 vCPUs.
>>>>>>>
>>>>>>> TIA.
>>>>>>>
>>>>>>> On Tue, May 19, 2020 at 3:22 PM Vijay Balakrishnan <
>>>>>>> bvijaykr@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Flink Dashboard UI seems to show tasks having a hard limit
for
>>>>>>>> Tasks column around 18000 on a Ubuntu Linux box.
>>>>>>>> I kept increasing the number of slots per task manager to
15 and
>>>>>>>> number of slots increased to 705 but the slots to tasks
>>>>>>>> stayed at around 18000. Below 18000 tasks, the Flink Job
is able to
>>>>>>>> start up.
>>>>>>>> Even though I increased the number of slots, it still works
when
>>>>>>>> 312 slots are being used.
>>>>>>>>
>>>>>>>> taskmanager.numberOfTaskSlots: 15
>>>>>>>>
>>>>>>>> What knob can I tune to increase the number of Tasks ?
>>>>>>>>
>>>>>>>> Pls find attached the Flink Dashboard UI.
>>>>>>>>
>>>>>>>> TIA,
>>>>>>>>
>>>>>>>>

Mime
View raw message