cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wido den Hollander <w...@widodh.nl>
Subject Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)
Date Thu, 10 Sep 2015 14:45:13 GMT


On 10-09-15 14:21, Andrija Panic wrote:
> Thx Wido,
> 
> I will have my colegue Igor and Dmytro join with details on this.
> 

Great!

> I agree we need fix upstream, that is the main purpose from our side!
> 

I love to see a pull request for rados-java :) 0.1.5 should be released
then.

> With this temp fix, we just avoid agent crashing (agent somehow restarts
> again fine :) ) but VMs also go down on that host, at least some of them.
> 

True, but I think the fix in rados-java won't be that hard.

> Do you see any lifecycle/workflow issue, if we implement deleting SNAP from
> CEPH after you SNAP a volume in ACS and sucsssfully move to Secondary NFS -
> or perhaps only delete SNAP from CEPH as a part of actuall SNAP deletion
> (when you manually or via scheduled snapshots, delete snapshot from DB and
> NFS) ? Maybe second option is better, I dont know how you guys handle this
> for regular NFS as primary storage etc...
> 

No, there is no problem. You can remove the RBD snapshot afterwards, ACS
will never touch it.

So it's fine to remove any RBD snapshot(s) from volumes without telling ACS.

Wido

> 
> Any guidance is most welcomed, and our team will try to code all this.
> 
> Thx Wido again
> 
> On 10 September 2015 at 14:14, Wido den Hollander <wido@widodh.nl> wrote:
> 
>>
>>
>> On 10-09-15 14:07, Andrija Panic wrote:
>>> Wido,
>>>
>>> part of code where you want to delete some volume, checks if volume is
>> type
>>> RBD - and then tries to list snapshots, delete snapshtos, and finally
>>> remove image. Here first step- Listing snapshtos-  fails, if there are
>> more
>>> than 16 snapshtos present - number 16 is hardcoded in elsewhere part of
>>> code and throws RBD exception...then agent crashes... and then VMs goe
>> down
>>> etc.
>>>
>>
>> Hmmm, that seems like a bug in rados-java indeed. I don't know if there
>> is a release of rados-java where this is fixed in.
>>
>> Looking at the code of rados-java it should, but I'm not 100% certain.
>>
>>> So our current way as quick fix is to invoke external script which will
>>> also list and remove all snapshtos, but will not fail.
>>>
>>
>> Yes, but we should fix it upstream. I understand that you will use a
>> temp script to clean up everything.
>>
>>> I'm not sure why is there 16 as the hardcoded limit - will try to provide
>>> part of code where this is present...we can increase this number but it
>>> doesn make any sense (from 16 to i.e. 200), since we still have lot of
>>> garbage left on CEPH (snapshtos that were removed in ACS (DB and
>> Secondary
>>> NFS) - but not removed from CEPH. And in my understanding this needs to
>> be
>>> implemented, so we dont catch any exceptions that I originally
>> described...
>>>
>>> Any thoughts on this ?
>>>
>>
>> A cleanup script for now should be OK indeed. Afterwards the Java code
>> should be able to do this.
>>
>> You can try manually by using rados-java and fix that.
>>
>> This is the part where the listing is done:
>>
>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java
>>
>> Wido
>>
>>> Thx for input!
>>>
>>> On 10 September 2015 at 13:56, Wido den Hollander <wido@widodh.nl>
>> wrote:
>>>
>>>>
>>>>
>>>> On 10-09-15 12:17, Andrija Panic wrote:
>>>>> We are testing some [dirty?] patch on our dev system and we shall soon
>>>>> share it for review.
>>>>>
>>>>> Basically, we are using external python script that is invoked in some
>>>> part
>>>>> of code execution to delete needed CEPH snapshots and then after that
>>>>> proceeds with volume deleteion etc...
>>>>>
>>>>
>>>> That shouldn't be required. The Java bindings for librbd and librados
>>>> should be able to remove the snapshots.
>>>>
>>>> There is no need to invoke external code, this can all be handled in
>> Java.
>>>>
>>>>> On 10 September 2015 at 11:26, Andrija Panic <andrija.panic@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Eh, OK. Thx for the info.
>>>>>>
>>>>>> BTW why is 16 snapshot limits hardcoded - any reason for that ?
>>>>>>
>>>>>> Not cleaning snapshots on CEPH and trying to delete volume after
>> having
>>>>>> more than 16 snapshtos in CEPH = Agent crashing on KVM side...and
some
>>>> VMs
>>>>>> being rebooted etc - which means downtime :|
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> On 9 September 2015 at 22:05, Simon Weller <sweller@ena.com>
wrote:
>>>>>>
>>>>>>> Andrija,
>>>>>>>
>>>>>>> The Ceph snapshot deletion is not currently implemented.
>>>>>>>
>>>>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302
>>>>>>>
>>>>>>> - Si
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Andrija Panic <andrija.panic@gmail.com>
>>>>>>> Sent: Wednesday, September 9, 2015 3:03 PM
>>>>>>> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
>>>>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH (only
from
>>>>>>> Secondaryt NFS and DB)
>>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also
>> affected) -
>>>>>>> when we delete some snapshot (volume snapshot) in ACS, ACS marks
it
>> as
>>>>>>> deleted in DB, deletes from NFS Secondary Storage but fails to
delete
>>>>>>> snapshot on CEPH primary storage (doesn even try to delete it
AFAIK)
>>>>>>>
>>>>>>> So we end up having 5 live snapshots in DB (just example) but
>> actually
>>>> in
>>>>>>> CEPH there are more than i.e. 16 snapshots.
>>>>>>>
>>>>>>> More of the issue, when ACS agent tries to obtain list of snapshots
>>>> from
>>>>>>> CEPH for some volume or so - if number of snapshots is over 16,
it
>>>> raises
>>>>>>> exception  (and perhaps this is the reason Agent crashed for
us -
>> need
>>>> to
>>>>>>> check with my colegues who are investigatin this in details).
This
>>>> number
>>>>>>> 16 is for whatever reasons hardcoded in ACS code.
>>>>>>>
>>>>>>> Wondering if anyone experienced this, or have any info - we plan
to
>>>> try to
>>>>>>> fix this, and I will inlcude my dev colegues here, but we might
need
>>>> some
>>>>>>> help at least for guidance or-
>>>>>>>
>>>>>>> Any help is really apreaciated or at list confirmation that this
is
>>>> known
>>>>>>> issue etc.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Andrija Panić
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Andrija Panić
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 

Mime
View raw message