flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasiliki Kalavri <vasilikikala...@gmail.com>
Subject Re: Job hangs
Date Wed, 27 Apr 2016 08:57:04 GMT
Hi Timur,

I've previously seen large batch jobs hang because of join deadlocks. We
should have fixed those problems, but we might have missed some corner
case. Did you check whether there was any cpu activity when the job hangs?
Can you try running htop on the taskmanager machines and see if they're
idle?

Cheers,
-Vasia.

On 27 April 2016 at 02:48, Timur Fayruzov <timur.fairuzov@gmail.com> wrote:

> Robert, Ufuk, logs, execution plan and a screenshot of the console are in
> the archive:
> https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0
>
> Note that when I looked in the backpressure view I saw back pressure
> 'high' on following paths:
>
> Input->code_line:123,124->map->join
> Input->code_line:134,135->map->join
> Input->code_line:121->map->join
>
> Unfortunately, I was not able to take thread dumps nor heap dumps (neither
> kill -3, jstack nor jmap worked, some Amazon AMI problem I assume).
>
> Hope that helps.
>
> Please, let me know if I can assist you in any way. Otherwise, I probably
> would not be actively looking at this problem.
>
> Thanks,
> Timur
>
>
> On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <uce@apache.org> wrote:
>
>> Can you please further provide the execution plan via
>>
>> env.getExecutionPlan()
>>
>>
>>
>> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
>> <timur.fairuzov@gmail.com> wrote:
>> > Hello Robert,
>> >
>> > I observed progress for 2 hours(meaning numbers change on dashboard),
>> and
>> > then I waited for 2 hours more. I'm sure it had to spill at some point,
>> but
>> > I figured 2h is enough time.
>> >
>> > Thanks,
>> > Timur
>> >
>> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rmetzger@apache.org> wrote:
>> >>
>> >> Hi Timur,
>> >>
>> >> thank you for sharing the source code of your job. That is helpful!
>> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is
>> much
>> >> more IO heavy with the larger input data because all the joins start
>> >> spilling?
>> >> Our monitoring, in particular for batch jobs is really not very
>> advanced..
>> >> If we had some monitoring showing the spill status, we would maybe see
>> that
>> >> the job is still running.
>> >>
>> >> How long did you wait until you declared the job hanging?
>> >>
>> >> Regards,
>> >> Robert
>> >>
>> >>
>> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <uce@apache.org> wrote:
>> >>>
>> >>> No.
>> >>>
>> >>> If you run on YARN, the YARN logs are the relevant ones for the
>> >>> JobManager and TaskManager. The client log submitting the job should
>> >>> be found in /log.
>> >>>
>> >>> – Ufuk
>> >>>
>> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
>> >>> <timur.fairuzov@gmail.com> wrote:
>> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are
>> there
>> >>> > any
>> >>> > logs besides what's in flink/log and yarn container logs?
>> >>> >
>> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <uce@apache.org> wrote:
>> >>> >
>> >>> > Hey Timur,
>> >>> >
>> >>> > is it possible to connect to the VMs and get stack traces of the
>> Flink
>> >>> > processes as well?
>> >>> >
>> >>> > We can first have a look at the logs, but the stack traces will
be
>> >>> > helpful if we can't figure out what the issue is.
>> >>> >
>> >>> > – Ufuk
>> >>> >
>> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <
>> trohrmann@apache.org>
>> >>> > wrote:
>> >>> >> Could you share the logs with us, Timur? That would be very
>> helpful.
>> >>> >>
>> >>> >> Cheers,
>> >>> >> Till
>> >>> >>
>> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <
>> timur.fairuzov@gmail.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Hello,
>> >>> >>>
>> >>> >>> Now I'm at the stage where my job seem to completely hang.
Source
>> >>> >>> code is
>> >>> >>> attached (it won't compile but I think gives a very good
idea of
>> what
>> >>> >>> happens). Unfortunately I can't provide the datasets. Most
of them
>> >>> >>> are
>> >>> >>> about
>> >>> >>> 100-500MM records, I try to process on EMR cluster with
40 tasks
>> 6GB
>> >>> >>> memory
>> >>> >>> for each.
>> >>> >>>
>> >>> >>> It was working for smaller input sizes. Any idea on what
I can do
>> >>> >>> differently is appreciated.
>> >>> >>>
>> >>> >>> Thans,
>> >>> >>> Timur
>> >>
>> >>
>> >
>>
>
>

Mime
View raw message