Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cloudstack.apache.org
Subject: Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from
 Secondaryt NFS and DB)
To: dev@cloudstack.apache.org
References: <55F19C75.4020807@safeswisscloud.com>
 <CAMvtBPOMr1raJkyCTOcLnYwYXuEi=3+Cjbmox5GdJQoSgr2_8w@mail.gmail.com>
 <55F28EA5.5080603@widodh.nl>
From: Wido den Hollander <wido@widodh.nl>
Message-ID: <55F2C6FB.60505@widodh.nl>
Date: Fri, 11 Sep 2015 14:20:11 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <55F28EA5.5080603@widodh.nl>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit


On 11-09-15 10:19, Wido den Hollander wrote:
> 
> 
> On 10-09-15 23:15, Andrija Panic wrote:
>> Wido,
>>
>> could you folow maybe what my colegue Dmytro just sent ?
>>
> 
> Yes, seems logical.
> 
>> Its not only matter of question fixing rados-java (16 snaps limit)  - it
>> seems that for any RBD exception, ACS will freak out...
>>
> 
> No, a RbdException will be caught, but the Rados Bindings shouldn't
> throw NegativeArraySizeException in any case.
> 
> That's the main problem.
> 

Seems to be fixed with this commit:
https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64

Just tested it with 256 snapshots:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running com.ceph.rbd.TestRbd
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 521.014 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

The bindings should now be capable of listing more then 16 snapshots.

You can build the bindings manually and replace rados.jar on your
running systems.

For 4.6 I'll try to get the updated rados-java included.

Wido

> Wido
> 
>> THx
>>
>> On 10 September 2015 at 17:06, Dmytro Shevchenko <
>> dmytro.shevchenko@safeswisscloud.com> wrote:
>>
>>> Hello everyone, some clarification about this. Configuration:
>>> CS: 4.5.1
>>> Primary storage: Ceph
>>>
>>> Actually we have 2 separate bugs:
>>>
>>> 1. When you remove some volume with more then 16 snapshots (doesn't matter
>>> destroyed or active - they always present on Ceph), on next storage garbage
>>> collector cycle it invoke 'deletePhysicalDisk' from
>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots from
>>> external rados-java library and getting exception.
>>>
>>> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854
>>> This exception do not catching in current function, but Agent DO NOT CRASH
>>> at this moment and continue working fine. Agent form proper answer to
>>> server and send it, text in answer - java stack trace. Log from Agent side:
>>>
>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>> (agentRequest-Handler-4:null) Trying to fetch storage pool
>>> 33ebaf83-5d09-3038-b63b-742e759a
>>> 992e from libvirt
>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>> (agentRequest-Handler-4:null) Attempting to remove volume
>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f
>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e
>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD snapshots of
>>> image cloudstack-storage
>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image
>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster at
>>> 10.10.1.26:6789
>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image
>>> cloudstack-storage/4c6a2092
>>> -056c-4446-a2ca-d6bba9f7f7f8
>>> 2015-09-10 02:32:35,457 WARN  [cloud.agent.Agent]
>>> (agentRequest-Handler-4:null) Caught:
>>> java.lang.NegativeArraySizeException
>>>         at com.ceph.rbd.RbdImage.snapList(Unknown Source)
>>>         at
>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
>>>         at
>>> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
>>>         at
>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent]
>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946:  { Ans: , MgmtId:
>>> 90520739779588, via: 1, Ver: v1,
>>>  Flags: 10,
>>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat
>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc
>>> e)\n\tat
>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat
>>> com.cloud.hypervisor.kvm.storage.Lib
>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat
>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j
>>> ava:1206)\n\tat
>>> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat
>>> com.cloud.storage.re.....
>>>
>>> so this volume and it snapshots never will be removed.
>>>
>>>
>>> 2. Second bug. Experimentally it has been found that after 50 minutes we
>>> had exception on Agent, for some unknown reason Management server decided
>>> about it lost connection to this agent, start HA process and start Agent
>>> process again.
>>> Log on Agent side:
>>> 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource]
>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk
>>> '{print $2}'
>>> 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource]
>>> (agentRequest-Handler-2:null) Execution is successful.
>>> 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource]
>>> (UgentTask-5:null) Executing:
>>> /usr/share/cloudstack-common/scripts/vm/network/security_
>>> group.py get_rule_logs_for_vms
>>> 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource]
>>> (UgentTask-5:null) Execution is successful.
>>> 2015-09-10 02:57:54,135 INFO  [cloud.agent.AgentShell] (main:null) Agent
>>> started
>>> 2015-09-10 02:57:54,136 INFO  [cloud.agent.AgentShell] (main:null)
>>> Implementation Version is 4.5.1
>>> 2015-09-10 02:57:54,138 INFO  [cloud.agent.AgentShell] (main:null)
>>> agent.properties found at /etc/cloudstack/agent/agent.properties
>>> .....
>>>
>>> Log on Server side:
>>> 2015-09-10 02:57:53,710 INFO  [c.c.a.m.AgentManagerImpl]
>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has disconnected
>>> with event AgentDisconnecte
>>> d
>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl]
>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive
>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request]
>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending { Cmd ,
>>> MgmtId: 90520739779588, via: 1(ix1
>>> -c7-2), Ver: v1, Flags: 100011,
>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
>>> 2015-09-10 02:57:53,724 INFO  [c.c.a.m.AgentAttache]
>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to send
>>> due to Resource [Host:1] is unr
>>> eachable: Host 1: Channel is closed
>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache]
>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling.
>>> 2015-09-10 02:57:53,724 WARN  [c.c.a.m.AgentManagerImpl]
>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: Host 1:
>>> Channel is closed
>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl]
>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to determine the
>>> state of the host
>>> .  Moving on.
>>>
>>>
>>> it's look like connection problem, but it appear only when we have this
>>> RBD exception on agent side and only with this node. I tried to play with
>>> "storage.cleanup.interval" parameter and set it to 5 minutes, now we
>>> getting exception on agent side every 5min, but disconnects still happening
>>> every 50min and I can't find out why.
>>>
>>> On 09/10/2015 03:21 PM, Andrija Panic wrote:
>>>
>>>> Thx Wido,
>>>>
>>>> I will have my colegue Igor and Dmytro join with details on this.
>>>>
>>>> I agree we need fix upstream, that is the main purpose from our side!
>>>>
>>>> With this temp fix, we just avoid agent crashing (agent somehow restarts
>>>> again fine :) ) but VMs also go down on that host, at least some of them.
>>>>
>>>> Do you see any lifecycle/workflow issue, if we implement deleting SNAP
>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to Secondary
>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall SNAP
>>>> deletion (when you manually or via scheduled snapshots, delete snapshot
>>>> from DB and NFS) ? Maybe second option is better, I dont know how you guys
>>>> handle this for regular NFS as primary storage etc...
>>>>
>>>>
>>>> Any guidance is most welcomed, and our team will try to code all this.
>>>>
>>>> Thx Wido again
>>>>
>>>> On 10 September 2015 at 14:14, Wido den Hollander <wido@widodh.nl
>>>> <mailto:wido@widodh.nl>> wrote:
>>>>
>>>>
>>>>
>>>>     On 10-09-15 14:07, Andrija Panic wrote:
>>>>     > Wido,
>>>>     >
>>>>     > part of code where you want to delete some volume, checks if
>>>>     volume is type
>>>>     > RBD - and then tries to list snapshots, delete snapshtos, and
>>>>     finally
>>>>     > remove image. Here first step- Listing snapshtos- fails, if
>>>>     there are more
>>>>     > than 16 snapshtos present - number 16 is hardcoded in elsewhere
>>>>     part of
>>>>     > code and throws RBD exception...then agent crashes... and then
>>>>     VMs goe down
>>>>     > etc.
>>>>     >
>>>>
>>>>     Hmmm, that seems like a bug in rados-java indeed. I don't know if
>>>>     there
>>>>     is a release of rados-java where this is fixed in.
>>>>
>>>>     Looking at the code of rados-java it should, but I'm not 100% certain.
>>>>
>>>>     > So our current way as quick fix is to invoke external script
>>>>     which will
>>>>     > also list and remove all snapshtos, but will not fail.
>>>>     >
>>>>
>>>>     Yes, but we should fix it upstream. I understand that you will use a
>>>>     temp script to clean up everything.
>>>>
>>>>     > I'm not sure why is there 16 as the hardcoded limit - will try
>>>>     to provide
>>>>     > part of code where this is present...we can increase this number
>>>>     but it
>>>>     > doesn make any sense (from 16 to i.e. 200), since we still have
>>>>     lot of
>>>>     > garbage left on CEPH (snapshtos that were removed in ACS (DB and
>>>>     Secondary
>>>>     > NFS) - but not removed from CEPH. And in my understanding this
>>>>     needs to be
>>>>     > implemented, so we dont catch any exceptions that I originally
>>>>     described...
>>>>     >
>>>>     > Any thoughts on this ?
>>>>     >
>>>>
>>>>     A cleanup script for now should be OK indeed. Afterwards the Java code
>>>>     should be able to do this.
>>>>
>>>>     You can try manually by using rados-java and fix that.
>>>>
>>>>     This is the part where the listing is done:
>>>>
>>>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java
>>>>
>>>>     Wido
>>>>
>>>>     > Thx for input!
>>>>     >
>>>>     > On 10 September 2015 at 13:56, Wido den Hollander
>>>>     <wido@widodh.nl <mailto:wido@widodh.nl>> wrote:
>>>>     >
>>>>     >>
>>>>     >>
>>>>     >> On 10-09-15 12:17, Andrija Panic wrote:
>>>>     >>> We are testing some [dirty?] patch on our dev system and we
>>>>     shall soon
>>>>     >>> share it for review.
>>>>     >>>
>>>>     >>> Basically, we are using external python script that is invoked
>>>>     in some
>>>>     >> part
>>>>     >>> of code execution to delete needed CEPH snapshots and then
>>>>     after that
>>>>     >>> proceeds with volume deleteion etc...
>>>>     >>>
>>>>     >>
>>>>     >> That shouldn't be required. The Java bindings for librbd and
>>>>     librados
>>>>     >> should be able to remove the snapshots.
>>>>     >>
>>>>     >> There is no need to invoke external code, this can all be
>>>>     handled in Java.
>>>>     >>
>>>>     >>> On 10 September 2015 at 11:26, Andrija Panic
>>>>     <andrija.panic@gmail.com <mailto:andrija.panic@gmail.com>>
>>>>     >>> wrote:
>>>>     >>>
>>>>     >>>> Eh, OK. Thx for the info.
>>>>     >>>>
>>>>     >>>> BTW why is 16 snapshot limits hardcoded - any reason for that ?
>>>>     >>>>
>>>>     >>>> Not cleaning snapshots on CEPH and trying to delete volume
>>>>     after having
>>>>     >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM
>>>>     side...and some
>>>>     >> VMs
>>>>     >>>> being rebooted etc - which means downtime :|
>>>>     >>>>
>>>>     >>>> Thanks,
>>>>     >>>>
>>>>     >>>> On 9 September 2015 at 22:05, Simon Weller <sweller@ena.com
>>>>     <mailto:sweller@ena.com>> wrote:
>>>>     >>>>
>>>>     >>>>> Andrija,
>>>>     >>>>>
>>>>     >>>>> The Ceph snapshot deletion is not currently implemented.
>>>>     >>>>>
>>>>     >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302
>>>>     >>>>>
>>>>     >>>>> - Si
>>>>     >>>>>
>>>>     >>>>> ________________________________________
>>>>     >>>>> From: Andrija Panic <andrija.panic@gmail.com>
>>>>     >>>>> Sent: Wednesday, September 9, 2015 3:03 PM
>>>>     >>>>> To: dev@cloudstack.apache.org
>>>>     <mailto:dev@cloudstack.apache.org>; users@cloudstack.apache.org
>>>>     <mailto:users@cloudstack.apache.org>
>>>>
>>>>     >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH
>>>>     (only from
>>>>     >>>>> Secondaryt NFS and DB)
>>>>     >>>>>
>>>>     >>>>> Hi folks,
>>>>     >>>>>
>>>>     >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also
>>>>     affected) -
>>>>     >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS
>>>>     marks it as
>>>>     >>>>> deleted in DB, deletes from NFS Secondary Storage but fails
>>>>     to delete
>>>>     >>>>> snapshot on CEPH primary storage (doesn even try to delete
>>>>     it AFAIK)
>>>>     >>>>>
>>>>     >>>>> So we end up having 5 live snapshots in DB (just example)
>>>>     but actually
>>>>     >> in
>>>>     >>>>> CEPH there are more than i.e. 16 snapshots.
>>>>     >>>>>
>>>>     >>>>> More of the issue, when ACS agent tries to obtain list of
>>>>     snapshots
>>>>     >> from
>>>>     >>>>> CEPH for some volume or so - if number of snapshots is over
>>>>     16, it
>>>>     >> raises
>>>>     >>>>> exception  (and perhaps this is the reason Agent crashed for
>>>>     us - need
>>>>     >> to
>>>>     >>>>> check with my colegues who are investigatin this in
>>>>     details). This
>>>>     >> number
>>>>     >>>>> 16 is for whatever reasons hardcoded in ACS code.
>>>>     >>>>>
>>>>     >>>>> Wondering if anyone experienced this, or have any info - we
>>>>     plan to
>>>>     >> try to
>>>>     >>>>> fix this, and I will inlcude my dev colegues here, but we
>>>>     might need
>>>>     >> some
>>>>     >>>>> help at least for guidance or-
>>>>     >>>>>
>>>>     >>>>> Any help is really apreaciated or at list confirmation that
>>>>     this is
>>>>     >> known
>>>>     >>>>> issue etc.
>>>>     >>>>>
>>>>     >>>>> Thanks,
>>>>     >>>>>
>>>>     >>>>> --
>>>>     >>>>>
>>>>     >>>>> Andrija Panić
>>>>     >>>>>
>>>>     >>>>
>>>>     >>>>
>>>>     >>>>
>>>>     >>>> --
>>>>     >>>>
>>>>     >>>> Andrija Panić
>>>>     >>>>
>>>>     >>>
>>>>     >>>
>>>>     >>>
>>>>     >>
>>>>     >
>>>>     >
>>>>     >
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Andrija Panić
>>>>
>>>
>>> --
>>> ---
>>> Best regards
>>> Dmytro Shevchenko
>>> dshevchenko.mail@gmail.com
>>> skype: demonsh_mk
>>> +380(66)2426648
>>>
>>>
>>
>>