flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Bode <maximilian.b...@tngtech.com>
Subject Re: Jobmanager HA with Rolling Sink in HDFS
Date Mon, 07 Mar 2016 13:23:25 GMT
Hi Aljoscha,

did you by any chance get around to looking at the problem again? It seems to us that at the
time when restoreState() is called, the files are already in final status and there are additional
.valid-length files (hence there are neither pending nor in-progress files). Furthermore,
one of the part files (11 in the example below) is missing completely. We are able to reproduce
this behavior by killing a task manager. Can you make sense of that?

Cheers,
 Max
—
Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

> Am 03.03.2016 um 15:04 schrieb Maximilian Bode <maximilian.bode@tngtech.com>:
> 
> Hi Aljoscha,
> 
> thank you for the fast answer. The files in HDFS change as follows:
> 
> -before task manager is killed:
> [user@host user]$ hdfs dfs -ls -R /hdfs/dir/outbound
> -rw-r--r--   2 user hadoop    2435461 2016-03-03 14:51 /hdfs/dir/outbound/_part-0-0.in-progress
> -rw-r--r--   2 user hadoop    2404604 2016-03-03 14:51 /hdfs/dir/outbound/_part-1-0.in-progress
> -rw-r--r--   2 user hadoop    2453796 2016-03-03 14:51 /hdfs/dir/outbound/_part-10-0.in-progress
> -rw-r--r--   2 user hadoop    2447794 2016-03-03 14:51 /hdfs/dir/outbound/_part-11-0.in-progress
> -rw-r--r--   2 user hadoop    2453479 2016-03-03 14:51 /hdfs/dir/outbound/_part-2-0.in-progress
> -rw-r--r--   2 user hadoop    2512413 2016-03-03 14:51 /hdfs/dir/outbound/_part-3-0.in-progress
> -rw-r--r--   2 user hadoop    2431290 2016-03-03 14:51 /hdfs/dir/outbound/_part-4-0.in-progress
> -rw-r--r--   2 user hadoop    2457820 2016-03-03 14:51 /hdfs/dir/outbound/_part-5-0.in-progress
> -rw-r--r--   2 user hadoop    2430218 2016-03-03 14:51 /hdfs/dir/outbound/_part-6-0.in-progress
> -rw-r--r--   2 user hadoop    2482970 2016-03-03 14:51 /hdfs/dir/outbound/_part-7-0.in-progress
> -rw-r--r--   2 user hadoop    2452622 2016-03-03 14:51 /hdfs/dir/outbound/_part-8-0.in-progress
> -rw-r--r--   2 user hadoop    2467733 2016-03-03 14:51 /hdfs/dir/outbound/_part-9-0.in-progress
> 
> -shortly after task manager is killed:
> [user@host user]$ hdfs dfs -ls -R /hdfs/dir/outbound
> -rw-r--r--   2 user hadoop   12360198 2016-03-03 14:52 /hdfs/dir/outbound/_part-0-0.pending
> -rw-r--r--   2 user hadoop    9350654 2016-03-03 14:52 /hdfs/dir/outbound/_part-1-0.in-progress
> -rw-r--r--   2 user hadoop    9389872 2016-03-03 14:52 /hdfs/dir/outbound/_part-10-0.in-progress
> -rw-r--r--   2 user hadoop   12236015 2016-03-03 14:52 /hdfs/dir/outbound/_part-11-0.pending
> -rw-r--r--   2 user hadoop   12256483 2016-03-03 14:52 /hdfs/dir/outbound/_part-2-0.pending
> -rw-r--r--   2 user hadoop   12261730 2016-03-03 14:52 /hdfs/dir/outbound/_part-3-0.pending
> -rw-r--r--   2 user hadoop    9418913 2016-03-03 14:52 /hdfs/dir/outbound/_part-4-0.in-progress
> -rw-r--r--   2 user hadoop   12176987 2016-03-03 14:52 /hdfs/dir/outbound/_part-5-0.pending
> -rw-r--r--   2 user hadoop   12165782 2016-03-03 14:52 /hdfs/dir/outbound/_part-6-0.pending
> -rw-r--r--   2 user hadoop    9474037 2016-03-03 14:52 /hdfs/dir/outbound/_part-7-0.in-progress
> -rw-r--r--   2 user hadoop   12136347 2016-03-03 14:52 /hdfs/dir/outbound/_part-8-0.pending
> -rw-r--r--   2 user hadoop   12305943 2016-03-03 14:52 /hdfs/dir/outbound/_part-9-0.pending
> 
> -still a bit later:
> [user@host user]$ hdfs dfs -ls -R /hdfs/dir/outbound
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-0-0.valid-length
> -rw-r--r--   2 user hadoop    8026667 2016-03-03 14:52 /hdfs/dir/outbound/_part-0-1.pending
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-1-0.valid-length
> -rw-r--r--   2 user hadoop    8097752 2016-03-03 14:52 /hdfs/dir/outbound/_part-1-1.pending
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-10-0.valid-length
> -rw-r--r--   2 user hadoop    7109259 2016-03-03 14:52 /hdfs/dir/outbound/_part-10-1.pending
> -rw-r--r--   2 user hadoop          0 2016-03-03 14:52 /hdfs/dir/outbound/_part-2-0.valid-length
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-3-0.valid-length
> -rw-r--r--   2 user hadoop    7974655 2016-03-03 14:52 /hdfs/dir/outbound/_part-3-1.pending
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-4-0.valid-length
> -rw-r--r--   2 user hadoop    5753163 2016-03-03 14:52 /hdfs/dir/outbound/_part-4-1.pending
> -rw-r--r--   2 user hadoop          0 2016-03-03 14:52 /hdfs/dir/outbound/_part-5-0.valid-length
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-6-0.valid-length
> -rw-r--r--   2 user hadoop    7904468 2016-03-03 14:52 /hdfs/dir/outbound/_part-6-1.pending
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-7-0.valid-length
> -rw-r--r--   2 user hadoop    7477004 2016-03-03 14:52 /hdfs/dir/outbound/_part-7-1.pending
> -rw-r--r--   2 user hadoop          0 2016-03-03 14:52 /hdfs/dir/outbound/_part-8-0.valid-length
> -rw-r--r--   2 user hadoop          9 2016-03-03 14:52 /hdfs/dir/outbound/_part-9-0.valid-length
> -rw-r--r--   2 user hadoop    7979307 2016-03-03 14:52 /hdfs/dir/outbound/_part-9-1.pending
> -rw-r--r--   2 user hadoop   12360198 2016-03-03 14:52 /hdfs/dir/outbound/part-0-0
> -rw-r--r--   2 user hadoop    9350654 2016-03-03 14:52 /hdfs/dir/outbound/part-1-0
> -rw-r--r--   2 user hadoop    9389872 2016-03-03 14:52 /hdfs/dir/outbound/part-10-0
> -rw-r--r--   2 user hadoop   12256483 2016-03-03 14:52 /hdfs/dir/outbound/part-2-0
> -rw-r--r--   2 user hadoop   12261730 2016-03-03 14:52 /hdfs/dir/outbound/part-3-0
> -rw-r--r--   2 user hadoop    9418913 2016-03-03 14:52 /hdfs/dir/outbound/part-4-0
> -rw-r--r--   2 user hadoop   12176987 2016-03-03 14:52 /hdfs/dir/outbound/part-5-0
> -rw-r--r--   2 user hadoop   12165782 2016-03-03 14:52 /hdfs/dir/outbound/part-6-0
> -rw-r--r--   2 user hadoop    9474037 2016-03-03 14:52 /hdfs/dir/outbound/part-7-0
> -rw-r--r--   2 user hadoop   12136347 2016-03-03 14:52 /hdfs/dir/outbound/part-8-0
> -rw-r--r--   2 user hadoop   12305943 2016-03-03 14:52 /hdfs/dir/outbound/part-9-0
> 
> Can you see from this what is going wrong?
> 
> Cheers,
>  Max
> —
> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com <mailto:maximilian.bode@tngtech.com>
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
> 
>> Am 03.03.2016 um 14:50 schrieb Aljoscha Krettek <aljoscha@apache.org <mailto:aljoscha@apache.org>>:
>> 
>> Hi,
>> did you check whether there are any files at your specified HDFS output location?
If yes, which files are there?
>> 
>> Cheers,
>> Aljoscha
>>> On 03 Mar 2016, at 14:29, Maximilian Bode <maximilian.bode@tngtech.com <mailto:maximilian.bode@tngtech.com>>
wrote:
>>> 
>>> Just for the sake of completeness: this also happens when killing a task manager
and is therefore probably unrelated to job manager HA.
>>> 
>>>> Am 03.03.2016 um 14:17 schrieb Maximilian Bode <maximilian.bode@tngtech.com
<mailto:maximilian.bode@tngtech.com>>:
>>>> 
>>>> Hi everyone,
>>>> 
>>>> unfortunately, I am running into another problem trying to establish exactly
once guarantees (Kafka -> Flink 1.0.0-rc3 -> HDFS).
>>>> 
>>>> When using
>>>> 
>>>> RollingSink<Tuple3<Integer,Integer,String>> sink = new RollingSink<Tuple3<Integer,Integer,String>>("hdfs://our.machine.com:8020/hdfs/dir/outbound
<hdfs://our.machine.com:8020/hdfs/dir/outbound>");
>>>> sink.setBucketer(new NonRollingBucketer());
>>>> output.addSink(sink);
>>>> 
>>>> and then killing the job manager, the new job manager is unable to restore
the old state throwing
>>>> ---
>>>> java.lang.Exception: Could not restore checkpointed state to operators and
functions
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:454)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:209)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>>>> 	at java.lang.Thread.run(Thread.java:744)
>>>> Caused by: java.lang.Exception: Failed to restore state to function: In-Progress
file hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 <hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0>
was neither moved to pending nor is still in progress.
>>>> 	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:168)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:446)
>>>> 	... 3 more
>>>> Caused by: java.lang.RuntimeException: In-Progress file hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0
<hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0> was neither moved to pending
nor is still in progress.
>>>> 	at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:686)
>>>> 	at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:122)
>>>> 	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:165)
>>>> 	... 4 more
>>>> ---
>>>> I found a resolved issue [1] concerning Hadoop 2.7.1. We are in fact using
2.4.0 – might this be the same issue?
>>>> 
>>>> Another thing I could think of is that the job is not configured correctly
and there is some sort of timing issue. The checkpoint interval is 10 seconds, everything
else was left at default value. Then again, as the NonRollingBucketer is used, there should
not be any timing issues, right?
>>>> 
>>>> Cheers,
>>>> Max
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-2979 <https://issues.apache.org/jira/browse/FLINK-2979>
>>>> 
>>>> —
>>>> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com <mailto:maximilian.bode@tngtech.com>
>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>>> 
>>> 
>> 
> 


Mime
View raw message