apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ananth Gundabattula <agundabatt...@gmail.com>
Subject Re: Containers getting killed by application master
Date Wed, 18 May 2016 07:52:51 GMT
Hello Bhupesh,

The Kafka operator seems to be the one crashing. I am using the Kafka 0.9
operator from Malhar on a kafka broker cluster running CDH kafka 2.x.

Attaching the logs of this particular operator for reference.

Please note that there is an exception from the netty driver and I believe
this is not the root cause as I have observed this exception being thrown
from the cassandra across other stacks as well.

However the following lines in the log for the killed operator looks
suspicious:

2016-05-18 07:17:17,556 INFO
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: Marking
the coordinator 2147483576 dead.
2016-05-18 07:17:17,556 WARN
org.apache.apex.malhar.kafka.AbstractKafkaInputOperator: Exceptions in
committing offsets
eventdetails_ingestion-8=OffsetAndMetadata{offset=1375350, metadata=''} :
org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is not
the correct coordinator for this group.
2016-05-18 07:21:23,611 INFO
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator: Offset
commit for group ced_Consumer failed due to NOT_COORDINATOR_FOR_GROUP, will
find new coordinator and retry
2016-05-18 07:21:23,611 INFO
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: Marking
the coordinator 2147483577 dead.
2016-05-18 07:21:23,611 WARN
org.apache.apex.malhar.kafka.AbstractKafkaInputOperator: Exceptions in
committing offsets
eventdetails_ingestion-8=OffsetAndMetadata{offset=1377033, metadata=''} :
org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is not
the correct coordinator for this group.
2016-05-18 07:21:23,612 INFO
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator: Offset
commit for group ced_Consumer failed due to NOT_COORDINATOR_FOR_GROUP, will
find new coordinator and retry
2016-05-18 07:21:23,612 WARN
org.apache.apex.malhar.kafka.AbstractKafkaInputOperator: Exceptions in
committing offsets
eventdetails_ingestion-8=OffsetAndMetadata{offset=1377073, metadata=''} :
org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is not
the correct coordinator for this group.
2016-05-18 07:22:17,950 INFO
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator: Offset
commit for group ced_Consumer failed due to NOT_COORDINATOR_FOR_GROUP, will
find new coordinator and retry
2016-05-18 07:22:17,950 INFO
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: Marking
the coordinator 2147483576 dead.



Regards,
Ananth

On Wed, May 18, 2016 at 5:29 PM, Bhupesh Chawda <bhupesh@datatorrent.com>
wrote:

> Hi Ananth,
>
> Do the containers that are getting killed belong to any specific operator?
> Or are these getting killed randomly.
> I'll suggest to have a look at the operator / container logs.
> You can also check this using: yarn logs --applicationId <App Id>
>
> ~Bhupesh
>
> On Wed, May 18, 2016 at 12:22 AM, Ananth Gundabattula <
> agundabattula@gmail.com> wrote:
>
>> Thanks all for the inputs.
>>
>> @Yogi: I do not have any operators that are dynamically partitioned. I
>> have not implemented any definePartition() in any of my operators.
>>
>> @Bhupesh: I am not using the JSON parser operator from Malhar. I do use
>> jackson parser as an instance inside my operator that does some application
>> level logic. The stack trace seems to be coming from the Apex pubsub codec
>> handler.
>>
>> @Ashwin : The window ID seems to be moving forward.
>>
>> I would like to understand more as to what we mean by container failure ?
>> I am assuming that Apex automatically relaunches a container if it fails
>> for whatever reason. In fact I do see operators getting killed ( and on
>> clicking the details button , I see the message posted at the beginning of
>> this thread)
>>
>> One thing I want to note is that the operators are recreated
>> automatically when they fail and after a couple of days, even this recovery
>> process seems to be broken. i.e. new instances of the operators are not
>> created automatically after they are dead and the app runs in a lower
>> operators count mode ( and hence some data is not getting processed)
>>
>> I observed this behavior on non-HA enabled cluster.  ( CDH 5.7 ) and
>> hence I do not suspect Yarn HA is causing this. I am currently ruling out
>> network issues as this would mean all operators need to exhibit some sort
>> of blips. ( Please correct me if I am wrong in this assumption)
>>
>> Regards,
>> Ananth
>>
>>
>>
>> On Wed, May 18, 2016 at 4:53 PM, Yogi Devendra <yogidevendra@apache.org>
>> wrote:
>>
>>> There are some instances of "Heartbeat for unknown operator" in the log.
>>> So, looks like operators are sending the heartbeats. But, STRAM is not
>>> able to identify the operator.
>>>
>>> In the past, I observed similar behavior when I was trying to define the
>>> dynamic partitioning for some operator.
>>>
>>>
>>> ~ Yogi
>>>
>>> On 18 May 2016 at 12:12, Ashwin Chandra Putta <ashwinchandrap@gmail.com>
>>> wrote:
>>>
>>>> Ananth,
>>>>
>>>> The heartbeat timeout means that the operator is not sending back the
>>>> window heartbeat information to the app master. It usually happens because
>>>> of one of two reasons.
>>>>
>>>> 1. System failure - container died, network failure etc.
>>>> 2. Windows not moving forward in the operator. Some business logic in
>>>> the operator is blocking the windows. You can observe the window IDs on the
>>>> UI for the given operator when it is running to quickly find out if this
is
>>>> the issue.
>>>>
>>>> Regards,
>>>> Ashwin.
>>>> On May 17, 2016 11:05 PM, "Ananth Gundabattula" <
>>>> agundabattula@gmail.com> wrote:
>>>>
>>>> Hello Sandeep,
>>>>
>>>> Thanks for the response. Please find attached the app master log.
>>>>
>>>> It looks like it got killed due to a heartbeat timeout. I will have to
>>>> see why I am getting a heartbeat timeout. I also see a JSON parser
>>>> exception in the logs in the log attached. Is it a harmless exception  ?
>>>>
>>>>
>>>> Regards,
>>>> Ananth
>>>>
>>>> On Wed, May 18, 2016 at 2:45 PM, Sandeep Deshmukh <
>>>> sandeep@datatorrent.com> wrote:
>>>>
>>>>> Dear Ananth,
>>>>>
>>>>> Could you please check the STRAM logs for any details of these
>>>>> containers. The first guess would be container going out of memory .
>>>>>
>>>>> Regards,
>>>>> Sandeep
>>>>>
>>>>> On Wed, May 18, 2016 at 10:05 AM, Ananth Gundabattula <
>>>>> agundabattula@gmail.com> wrote:
>>>>>
>>>>>> Hello All,
>>>>>>
>>>>>> I was wondering what would be the case for a container to be killed
>>>>>> by the application master ?
>>>>>>
>>>>>> I see the following in the UI when I click on details :
>>>>>>
>>>>>> "
>>>>>>
>>>>>> Container killed by the ApplicationMaster.
>>>>>> Container killed on request. Exit code is 143
>>>>>> Container exited with a non-zero exit code 143
>>>>>>
>>>>>> "
>>>>>>
>>>>>> I see zome exceptions in the dtgateway.log and am not sure if they
are related.
>>>>>>
>>>>>> I am running Apex 3.3.0 on CDH 5.7 and HA enabled (HA for YARN as
well as HDFS is enabled).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message