cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Weller <swel...@ena.com>
Subject Re: Agent dies every night/morning.... memory violation
Date Mon, 23 Feb 2015 16:30:59 GMT
I agree with Marcus. I suggest you start monitoring everything that's going on around this
time frame. 
Maybe dump available memory, and IO (both disk and network) to a file every minute or so,
and see if you can correlate it to something in particular that might be happening on the
underlying server, or the network connectivity to that server. Maybe slowly move  VMs one
at a time to a different host and see if the issue follows a particular VM.

In the mean time in order to reduce the affect of this problem, you could use a process monitoring
like Monit to watch the PID and restart cloudstack-agent if a failure is detected.

- Si

________________________________________
From: Marcus <shadowsor@gmail.com>
Sent: Monday, February 23, 2015 10:21 AM
To: dev@cloudstack.apache.org
Cc: users@cloudstack.apache.org
Subject: Re: Agent dies every night/morning.... memory violation

It doesn't really sound like an agent problem, but some other root
problem that is causing issues for the agent. Perhaps it is specific
to the host simply because there is a particular VM that always runs
on that host and the VM itself is triggering the issue. Perhaps a
heavy logrotate or cron job on the vm causes issues for librados. Just
grasping at straws here. From the output provided it does seem that
the libvirt bindings that include ceph code are terminating the agent
execution.  My guess is that if you focus on "why this host" as
opposed to "what's going on", you'll find the answer to both. Sorry, I
know that's not much help.

On Mon, Feb 23, 2015 at 7:29 AM, Andrija Panic <andrija.panic@gmail.com> wrote:
> Anybody?, before I start to cry :(
>
> On 21 February 2015 at 21:18, Andrija Panic <andrija.panic@gmail.com> wrote:
>
>> HI Simon,
>>
>> selinux is disabled, I have just double checked.
>>
>> BTW, this is what I can see in the cloudstack-agent.err log - seems like
>> some CEPH related issues, but not sure why would agent die...
>> If I recall correclty, this might be happening since the CEPH update from
>> 0.80.3? to 0.87 - and this seesm like some crash in librados....
>>
>>
>> libust[1907/2046]: Warning: HOME environment variable not set. Disabling
>> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt: Storage Driver error : failed to remove volume
>> 'cloudstack/bd751250-de35-4d2e-a4e3-3ee4b636c2a7': Device or resource busy
>> ./log/SubsystemMap.h: In function 'bool
>> ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread
>> 7f04427fc700 time 2015-02-21 06:39:38.839210
>> ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
>>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>>  1: (()+0x1fe223) [0x7f060c932223]
>>  2: (ObjectCacher::flusher_entry()+0x155) [0x7f060c9866e5]
>>  3: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f060c9976cd]
>>  4: (()+0x79d1) [0x7f06605ee9d1]
>>  5: (clone()+0x6d) [0x7f066033bb5d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>> 21/02/2015 06:39:38 1905 jsvc.exec error: Service did not exit cleanly
>>
>> On 20 February 2015 at 21:56, Simon Weller <sweller@ena.com> wrote:
>>
>>> Andrija,
>>>
>>> What is SELinux set to on this host?
>>>
>>>
>>> - SI
>>>
>>>
>>> ________________________________________
>>> From: Andrija Panic <andrija.panic@gmail.com>
>>> Sent: Friday, February 20, 2015 6:06 AM
>>> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
>>> Subject: Agent dies every night/morning.... memory violation
>>>
>>> Hi,
>>>
>>> I have crazy agent on one of the hosts, that is being killed each morning
>>> and I found this in /var/log/audit.log:
>>>
>>> type=ANOM_ABEND msg=audit(1424321463.930:430678): auid=0 uid=0 gid=0
>>> ses=68891 pid=10831 comm="jsvc" reason="memory violation" sig=6
>>>
>>> I dont remember changing anything on the system, but this keeps happening
>>> each morning arrond same time 5.20am-5.40am.
>>>
>>> I'm wondering what the hack is happening, any suggestions where to
>>> troubleshoot ?
>>> Will check logs in details anyway...
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>>
>
>
>
> --
>
> Andrija Panić
Mime
View raw message