flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishal Santoshi <vishal.santo...@gmail.com>
Subject Re: Why did JM fail on K8s (see original thread below)
Date Sat, 29 Jun 2019 14:03:16 GMT
We are investigating that.   But is the above theory plausible ( flink
gurus ) even though this, as in forcing restartPolicy: Never pretty much
nullifies HA on JM is it is a Job cluster ( at leats on k8s )


As for the reason we are investigating that.

One thing we looking as the QOS (
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ )
,  which was Burstable, a limit was not set on the job. The below for
example would have a QOS of guaranteed. So one reason could be this.

resources:
  requests:
    cpu: "8.0"
    memory: "16Gi"
  limits:
    cpu: "8.0"
    memory: "16Gi"





On Sat, Jun 29, 2019 at 9:35 AM Timothy Victor <victtim@gmail.com> wrote:

> This is slightly off topic, so I'm changing the subject to not conflate
> the original issue you brought up.   But do we know why JM crashed in the
> first place?
>
> We are also thinking of moving to K8s, but to be honest we had tons of
> stability issues in our first rodeo.  That could just be our lack of a
> deeper understanding of K8s (a lot had to do with DNS).  But cases like
> this make me wonder what happened to cause the crash in the first place.
>
> Thanks
>
> Tim
>
>
> On Sat, Jun 29, 2019, 8:25 AM Vishal Santoshi <vishal.santoshi@gmail.com>
> wrote:
>
>> Another point the JM had terminated. The policy on K8s for Job Cluster is
>>
>> spec:
>>   restartPolicy: OnFailure
>>
>>
>> *2019-06-29 00:33:14,308 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
>> entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.*
>>
>>
>>
>> On Sat, Jun 29, 2019 at 9:04 AM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> I have not tried on bare metal. We have no option but k8s.
>>>
>>> And this is a job cluster.
>>>
>>> On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor <victtim@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal, can this be reproduced on a bare metal instance as well?
>>>>  Also is this a job or a session cluster?
>>>>
>>>> Thanks
>>>>
>>>> Tim
>>>>
>>>> On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> OK this happened again and it is bizarre ( and is definitely not what
>>>>> I think should happen )
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The job failed and I see these logs  ( In essence it is keeping the
>>>>> last 5 externalized checkpoints )  but deleting the zk checkpoints directory
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *06.28.2019 20:33:13.738    2019-06-29 00:33:13,736 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5654 at
>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654'
>>>>> not discarded.    06.28.2019 20:33:13.788    2019-06-29 00:33:13,786
INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5655 at
>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655'
>>>>> not discarded.    06.28.2019 20:33:13.838    2019-06-29 00:33:13,836
INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5656 at
>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656'
>>>>> not discarded.    06.28.2019 20:33:13.888    2019-06-29 00:33:13,886
INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5657 at
>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657'
>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936
INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5658 at
>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658'
>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936
INFO
>>>>>  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
>>>>> /kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005
>>>>> from ZooKeeper*
>>>>>
>>>>> The job restarts and this is bizzare. It does not find the ZK
>>>>> checkpoint directory but instead of going to the
>>>>> state.checkpoints.dir to get it's last checkpoint, it restarts from a
>>>>> save point that we started this job with ( resetting the checkpoint id
)
>>>>> like about 15 days ago
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *    06.28.2019 20:33:20.047    2019-06-29 00:33:20,045 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.053
>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Trying to fetch 0 checkpoints from storage.    06.28.2019 20:33:20.053
>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Found 0 checkpoints in ZooKeeper.    06.28.2019 20:33:20.054    2019-06-29
>>>>> 00:33:20,053 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting
>>>>> job 00000000000000000000000000000005 from savepoint
>>>>> hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f
>>>>> ()    06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset
the
>>>>> checkpoint ID of job 00000000000000000000000000000005 to 4203.
>>>>> 06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.550
>>>>>  2019-06-29 00:33:20,549 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Trying to retrieve checkpoint 4202.    06.28.2019 20:33:20.550
>>>>>  2019-06-29 00:33:20,548 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Found 1 checkpoints in ZooKeeper.    06.28.2019 20:33:20.550    2019-06-29
>>>>> 00:33:20,549 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
 -
>>>>> Trying to fetch 1 checkpoints from storage.*
>>>>>
>>>>>
>>>>>
>>>>> This just does not make sense....
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> Ok, I will do that.
>>>>>>
>>>>>> On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <chesnay@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you provide us the jobmanager logs?
>>>>>>>
>>>>>>> After the first restart the JM should have started deleting older
>>>>>>> checkpoints as new ones were created.
>>>>>>> After the second restart the JM should have recovered all 10
>>>>>>> checkpoints, start from the latest, and start pruning old ones
as new ones
>>>>>>> were created.
>>>>>>>
>>>>>>> So you're running into 2 separate issues here, which is a bit
odd.
>>>>>>>
>>>>>>> On 05/06/2019 13:44, Vishal Santoshi wrote:
>>>>>>>
>>>>>>> Any one?
>>>>>>>
>>>>>>> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> The above is flink 1.8
>>>>>>>>
>>>>>>>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I had a sequence of events that created this issue.
>>>>>>>>>
>>>>>>>>> * I started a job and I had the state.checkpoints.num-retained:
5
>>>>>>>>>
>>>>>>>>> * As expected I have 5 latest checkpoints retained in
my hdfs
>>>>>>>>> backend.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> * JM dies ( K8s limit etc ) without cleaning the hdfs
directory.
>>>>>>>>> The k8s  job restores from the latest checkpoint ( I
think ) but as it
>>>>>>>>> creates new checkpoints it does not delete the older
chk point. At the end
>>>>>>>>> there are now 10 chkpoints,  5 from the old run which
remain static and 5
>>>>>>>>> latest representing the on going pipe.
>>>>>>>>>
>>>>>>>>> * The JM dies again and restart  from the latest from
the 5 old
>>>>>>>>> checkpoints.
>>>>>>>>>
>>>>>>>>> This looks a bug in the Job Cluster implementation of
flink. It
>>>>>>>>> looks like it is taking the 5th checkpoint from the beginning
based on
>>>>>>>>> num-retained value, Note that it has the same job id
and does not scope to
>>>>>>>>> a new directory.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>>>>>>>>
>>>>>>>>> Please tell me if this does not make sense.
>>>>>>>>>
>>>>>>>>> Vishal
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>

Mime
View raw message