flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2
Date Tue, 13 Nov 2018 14:35:17 GMT
Good to hear Cliff.

You're right that it's not a nice user experience. The problem with
queryable state is that one would need to take a look at the actual user
job to decide whether the user uses queryable state or not. But then it's
already too late for starting the respective infrastructure needed for
querying the state. You're right, though, that we should at least take a
random port per default. I've created a corresponding issue for this:
https://issues.apache.org/jira/browse/FLINK-10866.

Cheers,
Till

On Mon, Nov 12, 2018 at 11:16 PM Cliff Resnick <cresny@gmail.com> wrote:

> Hi Till,
>
> Yes, it turns out the problem was
> having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess
> Queriable State bootstraps itself and, in my situation, it brought the task
> manager down when it found no available ports. What's a little troubling is
> that I had not configured Queriable State at all, so I would not expect it
> to get in the way. I haven't looked further into it but I think that if
> Queriable State wants to enable itself then it should at worst take an
> unused port by default, especially since many folks will be running in
> shared environments like YARN.
>
> But anyway, thanks for that! I'm now up with 1.6.2.
>
> Cliff
>
> On Mon, Nov 12, 2018 at 6:04 AM Till Rohrmann <trohrmann@apache.org>
> wrote:
>
>> Hi Cliff,
>>
>> the TaskManger fail to start with exit code 31 which indicates an
>> initialization error on startup. If you check the TaskManager logs via
>> `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs
>> don't start up.
>>
>> Cheers,
>> Till
>>
>> On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <cresny@gmail.com> wrote:
>>
>>> Hi Till,
>>>
>>> Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG
>>> level. I saw several errors in 1.6.2, hope it's informative!
>>>
>>> Cliff
>>>
>>> On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <trohrmann@apache.org>
>>> wrote:
>>>
>>>> Hi Cliff,
>>>>
>>>> this sounds not right. Could you share the logs of the Yarn cluster
>>>> entrypoint with the community for further debugging? Ideally on DEBUG
>>>> level. The Yarn logs would also be helpful to fully understand the problem.
>>>> Thanks a lot!
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cresny@gmail.com> wrote:
>>>>
>>>>> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
>>>>> configuration of 3 slots per TM. The cluster is dedicated to a single
job
>>>>> that runs at full capacity in "FLIP6" mode. So in this cluster, the
>>>>> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>>>>>
>>>>> When I run the job in 1.6.0, seven Task Managers are spun up as
>>>>> expected. But if I run with 1.6.2 only four Task Managers spin up and
the
>>>>> job hangs waiting for more resources.
>>>>>
>>>>> Our Flink distribution is set up by script after building from source.
>>>>> So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical.
>>>>> The job is the same, restarting from savepoint. The problem is repeatable.
>>>>>
>>>>> Has something changed in 1.6.2, and if so can it be remedied with a
>>>>> config change?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Mime
View raw message