flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Yao <g...@da-platform.com>
Subject Re: Flink 1.7.1 job is stuck in running state
Date Fri, 18 Jan 2019 14:49:37 GMT
Hi Piotr,

Ideally on DEBUG level.

Best,
Gary

On Fri, Jan 18, 2019 at 3:41 PM Piotr Szczepanek <piotr.szczepanek@gmail.com>
wrote:

> Hey Gary,
> thanks for your reply.
> Before we have been using Flink version 1.5.2.
> With both version we're using Flink deployed on Yarn.
>
> Regarding log would you like to have log entries with DEBUG enabled or
> INFO would be enough?
>
> Thanks,
> Piotr
>
> pt., 18 sty 2019 o 15:14 Gary Yao <gary@da-platform.com> napisaƂ(a):
>
>> Hi Piotr,
>>
>> What was the version you were using before 1.7.1?
>> How do you deploy your cluster, e.g., YARN, standalone?
>> Can you attach full TM and JM logs?
>>
>> Best,
>> Gary
>>
>> On Fri, Jan 18, 2019 at 3:05 PM Piotr Szczepanek <
>> piotr.szczepanek@gmail.com> wrote:
>>
>>> Hello,
>>> we have scenario with running Data Processing jobs that generates export
>>> files on demand. Our first approach was using ClusterClient, but recently
>>> we switched to REST API for job submittion. In the meantime we switched to
>>> flink 1.7.1 and that started to cause a problems.
>>> Some of our jobs are stuck, not processing any data. Task Managers have
>>> info that Chain is switching to RUNNING, and then nothing happenes.
>>> In TM's stdout logs we can see that for some reason log is cut, e.g.:
>>>
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
>>> initialized will read a total of 615 records.
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
>>> next block
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
>>> in 63 ms. row count = 615
>>> Jan 10, 2019 4:28:33 PM WARNING:
>>> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
>>> due to context is not a instance of TaskInputOutputContext, but is
>>> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
>>> initialized will read a total of 140 records.
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
>>> next block
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
>>> in 2 ms. row count = 140
>>> Jan 10, 2019 4:28:33 PM WARNING:
>>> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
>>> due to context is not a instance of TaskInputOutputContext, but is
>>> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
>>> Jan 10, 2019 4:28:33 PM INFO: or
>>>
>>> As you can see, last line is cut in the middle, and nothing happenes
>>> later on.
>>> None of counters ( records/bytes sent/read) are increased.
>>> We switched debug on on both TMs and JM but only thing they are showing
>>> up are sending heartbeats between each other.
>>> Do you have any idea what could be a problem? and how we could deal with
>>> them or at least try to investigate? Is there any timeout/config that we
>>> could try to enable?
>>>
>>

Mime
View raw message