cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmytro Shevchenko <dmytro.shevche...@safeswisscloud.com>
Subject Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)
Date Thu, 17 Sep 2015 12:33:58 GMT
Nice work. I compiled and install new version into local maven 
repository, but now I can't compile Cloudstack with this library. I 
changed dependency version in pom file to new, but got next exception 
while compiling 'cloud-plugin-hypervisor-kvm':

Konsole output
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile 
(default-compile) on project cloud-plugin-hypervisor-kvm: Compilation 
failure: Co
mpilation failure:
[ERROR] Picked up JAVA_TOOL_OPTIONS: 
-javaagent:/usr/share/java/jayatanaag.jar
[ERROR] 
/home/dmytro.shevchenko/test/cloudstack/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/LibvirtComputingResource.java:[80,21]

error: cannot find symbol
[ERROR] symbol:   class RadosException
[ERROR] location: package com.ceph.rados

After investigation I found that class RadosException was moved to 
Konsole output 'exceptions' subdirectory, but in 
LibvirtComputingResource.java it import as " import 
com.ceph.rados.RadosException;". Question is next, if I want compile 
some release version with this new changes, which way would be 
preferred? change import path LibvirtComputingResource.java and other 
places where this class is used to 
"com.ceph.rados.exceptions.RadosException"?


-- 
Best regards
Dmytro Shevchenko
dmytro.shevchenko@safeswisscloud.com
skype: demonsh_mk



On 09/15/2015 03:11 PM, Wido den Hollander wrote:
>
> On 15-09-15 13:56, Dmytro Shevchenko wrote:
>> Hello Wido, I saw you updated this code again. Maybe you know, what
>> procedure for rebuilding this library in Apache maven repository?
>> Because here http://repo.maven.apache.org/maven2/com/ceph/rados/ still
>> present only old 0.1.4 version and it's impossible recompile Cloudstack
>> with new patches.  Of cause we can download source code from Github,
>> compile it and replace 'jar' file on production, but this is dirty hack
>> and not acceptable for 'continues integration'.
>>
> It's up to me to do a new release of rados-java and I haven't done that
> yet since I wanted to know for sure if the code works.
>
> While writing some code for libvirt yesterday I came up with a better
> solution for rados-java as well.
>
> https://www.redhat.com/archives/libvir-list/2015-September/msg00458.html
>
> For now you can replace 'rados.jar' on the production systems, but for
> 4.6 I want to make sure we depend on a new, to be released, version of
> rados-java.
>
> Wido
>
>> ---
>> Best regards
>> Dmytro Shevchenko
>> dshevchenko.mail@gmail.com
>> skype: demonsh_mk
>>
>>
>> On 09/12/2015 06:16 PM, Wido den Hollander wrote:
>>> On 09/11/2015 05:08 PM, Andrija Panic wrote:
>>>> THx a lot Wido !!! - we will patch this - For my understanding - is this
>>>> "temorary"solution - since it raises limit to 256 snaps ? Or am I
>>>> wrong ? I
>>>> mean, since we dont stil have proper snapshots removal etc, so after
>>>> i.e.
>>>> 3-6months we will again have 256 snapshots of a single volume on CEPH ?
>>>>
>>> No, it will also work with >256 snapshots. I've tested it with 256 and
>>> that worked fine. I see no reason why it won't work with 1024 or 2048
>>> for example.
>>>
>>>> BTW we also have other exception, that causes same consequences - agent
>>>> disocnnecting and VMs going down...
>>>> As Dmytro explained, unprotecting snapshot causes same consequence...
>>>>
>>>>   From my understanding, any RBD exception, might cause Agent to
>>>> disconnect
>>>> (or actually mgmt server to disconnect agent)...
>>>>
>>>> Any clue on this, recommendation?
>>>>
>>> No, I don't have a clue. It could be that the job hangs somewhere inside
>>> the Agent due to a uncaught exception though.
>>>
>>>> Thx a lot for fixing rados-java stuff !
>>>>
>>> You're welcome!
>>>
>>> Wido
>>>
>>>> Andrija
>>>>
>>>> On 11 September 2015 at 15:28, Wido den Hollander<wido@widodh.nl>
>>>> wrote:
>>>>
>>>>> On 11-09-15 14:43, Dmytro Shevchenko wrote:
>>>>>> Thanks a lot Wido! Any chance to find out why management server
>>>>>> decided
>>>>>> that it lost connection to agent after that exceptions? It's not
so
>>>>>> critical as this bug with 16 snapshots, but during last week we catch
>>>>>> situation when Agent failed unprotect snapshot, rise exception and
>>>>>> this
>>>>>> is was a reason of disconnection a bit later after that. (It is not
>>>>>> clear why CS decided remove that volume, it was template with one
>>>>>> 'gold'
>>>>>> snapshot with several active clones)
>>>>>>
>>>>> No, I didn't look at CS at all. I just spend the day improving the
>>>>> RADOS
>>>>> bindings.
>>>>>
>>>>> Wido
>>>>>
>>>>>> On 09/11/2015 03:20 PM, Wido den Hollander wrote:
>>>>>>> On 11-09-15 10:19, Wido den Hollander wrote:
>>>>>>>> On 10-09-15 23:15, Andrija Panic wrote:
>>>>>>>>> Wido,
>>>>>>>>>
>>>>>>>>> could you folow maybe what my colegue Dmytro just sent
?
>>>>>>>>>
>>>>>>>> Yes, seems logical.
>>>>>>>>
>>>>>>>>> Its not only matter of question fixing rados-java (16
snaps limit)
>>>>>>>>> - it
>>>>>>>>> seems that for any RBD exception, ACS will freak out...
>>>>>>>>>
>>>>>>>> No, a RbdException will be caught, but the Rados Bindings
shouldn't
>>>>>>>> throw NegativeArraySizeException in any case.
>>>>>>>>
>>>>>>>> That's the main problem.
>>>>>>>>
>>>>>>> Seems to be fixed with this commit:
>>>>>>>
>>>>> https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64
>>>>>
>>>>>>> Just tested it with 256 snapshots:
>>>>>>>
>>>>>>> -------------------------------------------------------
>>>>>>>     T E S T S
>>>>>>> -------------------------------------------------------
>>>>>>> Running com.ceph.rbd.TestRbd
>>>>>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
>>>>>>> 521.014 sec
>>>>>>>
>>>>>>> Results :
>>>>>>>
>>>>>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
>>>>>>>
>>>>>>> The bindings should now be capable of listing more then 16 snapshots.
>>>>>>>
>>>>>>> You can build the bindings manually and replace rados.jar on
your
>>>>>>> running systems.
>>>>>>>
>>>>>>> For 4.6 I'll try to get the updated rados-java included.
>>>>>>>
>>>>>>> Wido
>>>>>>>
>>>>>>>> Wido
>>>>>>>>
>>>>>>>>> THx
>>>>>>>>>
>>>>>>>>> On 10 September 2015 at 17:06, Dmytro Shevchenko <
>>>>>>>>> dmytro.shevchenko@safeswisscloud.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello everyone, some clarification about this. Configuration:
>>>>>>>>>> CS: 4.5.1
>>>>>>>>>> Primary storage: Ceph
>>>>>>>>>>
>>>>>>>>>> Actually we have 2 separate bugs:
>>>>>>>>>>
>>>>>>>>>> 1. When you remove some volume with more then 16
snapshots
>>>>>>>>>> (doesn't
>>>>>>>>>> matter
>>>>>>>>>> destroyed or active - they always present on Ceph),
on next
>>>>>>>>>> storage
>>>>>>>>>> garbage
>>>>>>>>>> collector cycle it invoke 'deletePhysicalDisk' from
>>>>>>>>>> LibvirtStorageAdaptor.java. On line 854 we calling
list snapshots
>>>>> from
>>>>>>>>>> external rados-java library and getting exception.
>>>>>>>>>>
>>>>>>>>>>
>>>>> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854
>>>>>
>>>>>>>>>> This exception do not catching in current function,
but Agent DO
>>>>>>>>>> NOT CRASH
>>>>>>>>>> at this moment and continue working fine. Agent form
proper
>>>>>>>>>> answer to
>>>>>>>>>> server and send it, text in answer - java stack trace.
Log from
>>>>>>>>>> Agent side:
>>>>>>>>>>
>>>>>>>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>>> (agentRequest-Handler-4:null) Trying to fetch storage
pool
>>>>>>>>>> 33ebaf83-5d09-3038-b63b-742e759a
>>>>>>>>>> 992e from libvirt
>>>>>>>>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>>> (agentRequest-Handler-4:null) Attempting to remove
volume
>>>>>>>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f
>>>>>>>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e
>>>>>>>>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>>> (agentRequest-Handler-4:null) Unprotecting and Removing
RBD
>>>>>>>>>> snapshots of
>>>>>>>>>> image cloudstack-storage
>>>>>>>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing
the image
>>>>>>>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>>> (agentRequest-Handler-4:null) Succesfully connected
to Ceph
>>>>>>>>>> cluster
>>>>> at
>>>>>>>>>> 10.10.1.26:6789
>>>>>>>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>>> (agentRequest-Handler-4:null) Fetching list of snapshots
of RBD
>>>>>>>>>> image
>>>>>>>>>> cloudstack-storage/4c6a2092
>>>>>>>>>> -056c-4446-a2ca-d6bba9f7f7f8
>>>>>>>>>> 2015-09-10 02:32:35,457 WARN  [cloud.agent.Agent]
>>>>>>>>>> (agentRequest-Handler-4:null) Caught:
>>>>>>>>>> java.lang.NegativeArraySizeException
>>>>>>>>>>            at com.ceph.rbd.RbdImage.snapList(Unknown
Source)
>>>>>>>>>>            at
>>>>>>>>>>
>>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
>>>>>
>>>>>>>>>>            at
>>>>>>>>>>
>>>>> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
>>>>>
>>>>>>>>>>            at
>>>>>>>>>>
>>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
>>>>>
>>>>>>>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent]
>>>>>>>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946:
 { Ans: ,
>>>>>>>>>> MgmtId:
>>>>>>>>>> 90520739779588, via: 1, Ver: v1,
>>>>>>>>>>     Flags: 10,
>>>>>>>>>>
>>>>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat
>>>>>
>>>>>>>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc
>>>>>>>>>> e)\n\tat
>>>>>>>>>>
>>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat
>>>>>
>>>>>>>>>> com.cloud.hypervisor.kvm.storage.Lib
>>>>>>>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat
>>>>>>>>>>
>>>>>>>>>>
>>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j
>>>>>
>>>>>>>>>> ava:1206)\n\tat
>>>>>>>>>>
>>>>> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat
>>>>>
>>>>>>>>>> com.cloud.storage.re.....
>>>>>>>>>>
>>>>>>>>>> so this volume and it snapshots never will be removed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. Second bug. Experimentally it has been found that
after 50
>>>>>>>>>> minutes we
>>>>>>>>>> had exception on Agent, for some unknown reason Management
server
>>>>>>>>>> decided
>>>>>>>>>> about it lost connection to this agent, start HA
process and start
>>>>>>>>>> Agent
>>>>>>>>>> process again.
>>>>>>>>>> Log on Agent side:
>>>>>>>>>> 2015-09-10 02:57:12,664 DEBUG
>>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>>> (agentRequest-Handler-2:null) Executing: /bin/bash
-c free|grep
>>>>>>>>>> Mem:|awk
>>>>>>>>>> '{print $2}'
>>>>>>>>>> 2015-09-10 02:57:12,667 DEBUG
>>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>>> (agentRequest-Handler-2:null) Execution is successful.
>>>>>>>>>> 2015-09-10 02:57:40,502 DEBUG
>>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>>> (UgentTask-5:null) Executing:
>>>>>>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_
>>>>>>>>>> group.py get_rule_logs_for_vms
>>>>>>>>>> 2015-09-10 02:57:40,572 DEBUG
>>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>>> (UgentTask-5:null) Execution is successful.
>>>>>>>>>> 2015-09-10 02:57:54,135 INFO  [cloud.agent.AgentShell]
(main:null)
>>>>>>>>>> Agent
>>>>>>>>>> started
>>>>>>>>>> 2015-09-10 02:57:54,136 INFO  [cloud.agent.AgentShell]
(main:null)
>>>>>>>>>> Implementation Version is 4.5.1
>>>>>>>>>> 2015-09-10 02:57:54,138 INFO  [cloud.agent.AgentShell]
(main:null)
>>>>>>>>>> agent.properties found at /etc/cloudstack/agent/agent.properties
>>>>>>>>>> .....
>>>>>>>>>>
>>>>>>>>>> Log on Server side:
>>>>>>>>>> 2015-09-10 02:57:53,710 INFO  [c.c.a.m.AgentManagerImpl]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why
host 1 has
>>>>>>>>>> disconnected
>>>>>>>>>> with event AgentDisconnecte
>>>>>>>>>> d
>>>>>>>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent
(1) is alive
>>>>>>>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
>>>>>>>>>> Sending {
>>>>>>>>>> Cmd ,
>>>>>>>>>> MgmtId: 90520739779588, via: 1(ix1
>>>>>>>>>> -c7-2), Ver: v1, Flags: 100011,
>>>>>>>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}]
}
>>>>>>>>>> 2015-09-10 02:57:53,724 INFO  [c.c.a.m.AgentAttache]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
>>>>>>>>>> Unable to
>>>>>>>>>> send
>>>>>>>>>> due to Resource [Host:1] is unr
>>>>>>>>>> eachable: Host 1: Channel is closed
>>>>>>>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
>>>>>>>>>> Cancelling.
>>>>>>>>>> 2015-09-10 02:57:53,724 WARN  [c.c.a.m.AgentManagerImpl]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1]
is unreachable:
>>>>>>>>>> Host 1:
>>>>>>>>>> Channel is closed
>>>>>>>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl]
>>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator
unable to
>>>>>>>>>> determine the
>>>>>>>>>> state of the host
>>>>>>>>>> .  Moving on.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> it's look like connection problem, but it appear
only when we have
>>>>>>>>>> this
>>>>>>>>>> RBD exception on agent side and only with this node.
I tried to
>>>>>>>>>> play with
>>>>>>>>>> "storage.cleanup.interval" parameter and set it to
5 minutes,
>>>>>>>>>> now we
>>>>>>>>>> getting exception on agent side every 5min, but disconnects
still
>>>>>>>>>> happening
>>>>>>>>>> every 50min and I can't find out why.
>>>>>>>>>>
>>>>>>>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote:
>>>>>>>>>>
>>>>>>>>>>> Thx Wido,
>>>>>>>>>>>
>>>>>>>>>>> I will have my colegue Igor and Dmytro join with
details on this.
>>>>>>>>>>>
>>>>>>>>>>> I agree we need fix upstream, that is the main
purpose from our
>>>>> side!
>>>>>>>>>>> With this temp fix, we just avoid agent crashing
(agent somehow
>>>>>>>>>>> restarts
>>>>>>>>>>> again fine :) ) but VMs also go down on that
host, at least some
>>>>>>>>>>> of them.
>>>>>>>>>>>
>>>>>>>>>>> Do you see any lifecycle/workflow issue, if we
implement deleting
>>>>>>>>>>> SNAP
>>>>>>>>>>> from CEPH after you SNAP a volume in ACS and
sucsssfully move to
>>>>>>>>>>> Secondary
>>>>>>>>>>> NFS - or perhaps only delete SNAP from CEPH as
a part of actuall
>>>>> SNAP
>>>>>>>>>>> deletion (when you manually or via scheduled
snapshots, delete
>>>>>>>>>>> snapshot
>>>>>>>>>>> from DB and NFS) ? Maybe second option is better,
I dont know how
>>>>>>>>>>> you guys
>>>>>>>>>>> handle this for regular NFS as primary storage
etc...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Any guidance is most welcomed, and our team will
try to code all
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Thx Wido again
>>>>>>>>>>>
>>>>>>>>>>> On 10 September 2015 at 14:14, Wido den Hollander
<wido@widodh.nl
>>>>>>>>>>> <mailto:wido@widodh.nl>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        On 10-09-15 14:07, Andrija Panic wrote:
>>>>>>>>>>>        > Wido,
>>>>>>>>>>>        >
>>>>>>>>>>>        > part of code where you want to delete
some volume,
>>>>>>>>>>> checks if
>>>>>>>>>>>        volume is type
>>>>>>>>>>>        > RBD - and then tries to list snapshots,
delete
>>>>>>>>>>> snapshtos, and
>>>>>>>>>>>        finally
>>>>>>>>>>>        > remove image. Here first step- Listing
snapshtos-
>>>>>>>>>>> fails, if
>>>>>>>>>>>        there are more
>>>>>>>>>>>        > than 16 snapshtos present - number
16 is hardcoded in
>>>>>>>>>>> elsewhere
>>>>>>>>>>>        part of
>>>>>>>>>>>        > code and throws RBD exception...then
agent crashes... and
>>>>> then
>>>>>>>>>>>        VMs goe down
>>>>>>>>>>>        > etc.
>>>>>>>>>>>        >
>>>>>>>>>>>
>>>>>>>>>>>        Hmmm, that seems like a bug in rados-java
indeed. I don't
>>>>>>>>>>> know if
>>>>>>>>>>>        there
>>>>>>>>>>>        is a release of rados-java where this
is fixed in.
>>>>>>>>>>>
>>>>>>>>>>>        Looking at the code of rados-java it should,
but I'm not
>>>>>>>>>>> 100%
>>>>>>>>>>> certain.
>>>>>>>>>>>
>>>>>>>>>>>        > So our current way as quick fix is
to invoke external
>>>>>>>>>>> script
>>>>>>>>>>>        which will
>>>>>>>>>>>        > also list and remove all snapshtos,
but will not fail.
>>>>>>>>>>>        >
>>>>>>>>>>>
>>>>>>>>>>>        Yes, but we should fix it upstream. I
understand that you
>>>>>>>>>>> will use a
>>>>>>>>>>>        temp script to clean up everything.
>>>>>>>>>>>
>>>>>>>>>>>        > I'm not sure why is there 16 as the
hardcoded limit -
>>>>>>>>>>> will
>>>>> try
>>>>>>>>>>>        to provide
>>>>>>>>>>>        > part of code where this is present...we
can increase this
>>>>>>>>>>> number
>>>>>>>>>>>        but it
>>>>>>>>>>>        > doesn make any sense (from 16 to
i.e. 200), since we
>>>>>>>>>>> still
>>>>>>>>>>> have
>>>>>>>>>>>        lot of
>>>>>>>>>>>        > garbage left on CEPH (snapshtos that
were removed in ACS
>>>>>>>>>>> (DB and
>>>>>>>>>>>        Secondary
>>>>>>>>>>>        > NFS) - but not removed from CEPH.
And in my understanding
>>>>> this
>>>>>>>>>>>        needs to be
>>>>>>>>>>>        > implemented, so we dont catch any
exceptions that I
>>>>> originally
>>>>>>>>>>>        described...
>>>>>>>>>>>        >
>>>>>>>>>>>        > Any thoughts on this ?
>>>>>>>>>>>        >
>>>>>>>>>>>
>>>>>>>>>>>        A cleanup script for now should be OK
indeed. Afterwards
>>>>>>>>>>> the
>>>>>>>>>>> Java code
>>>>>>>>>>>        should be able to do this.
>>>>>>>>>>>
>>>>>>>>>>>        You can try manually by using rados-java
and fix that.
>>>>>>>>>>>
>>>>>>>>>>>        This is the part where the listing is
done:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java
>>>>>
>>>>>>>>>>>        Wido
>>>>>>>>>>>
>>>>>>>>>>>        > Thx for input!
>>>>>>>>>>>        >
>>>>>>>>>>>        > On 10 September 2015 at 13:56, Wido
den Hollander
>>>>>>>>>>>        <wido@widodh.nl  <mailto:wido@widodh.nl>>
wrote:
>>>>>>>>>>>        >
>>>>>>>>>>>        >>
>>>>>>>>>>>        >>
>>>>>>>>>>>        >> On 10-09-15 12:17, Andrija Panic
wrote:
>>>>>>>>>>>        >>> We are testing some [dirty?]
patch on our dev system
>>>>>>>>>>> and we
>>>>>>>>>>>        shall soon
>>>>>>>>>>>        >>> share it for review.
>>>>>>>>>>>        >>>
>>>>>>>>>>>        >>> Basically, we are using external
python script that is
>>>>>>>>>>> invoked
>>>>>>>>>>>        in some
>>>>>>>>>>>        >> part
>>>>>>>>>>>        >>> of code execution to delete
needed CEPH snapshots
>>>>>>>>>>> and then
>>>>>>>>>>>        after that
>>>>>>>>>>>        >>> proceeds with volume deleteion
etc...
>>>>>>>>>>>        >>>
>>>>>>>>>>>        >>
>>>>>>>>>>>        >> That shouldn't be required. The
Java bindings for
>>>>>>>>>>> librbd and
>>>>>>>>>>>        librados
>>>>>>>>>>>        >> should be able to remove the
snapshots.
>>>>>>>>>>>        >>
>>>>>>>>>>>        >> There is no need to invoke external
code, this can
>>>>>>>>>>> all be
>>>>>>>>>>>        handled in Java.
>>>>>>>>>>>        >>
>>>>>>>>>>>        >>> On 10 September 2015 at 11:26,
Andrija Panic
>>>>>>>>>>>        <andrija.panic@gmail.com  <mailto:andrija.panic@gmail.com>>
>>>>>>>>>>>        >>> wrote:
>>>>>>>>>>>        >>>
>>>>>>>>>>>        >>>> Eh, OK. Thx for the info.
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>> BTW why is 16 snapshot
limits hardcoded - any
>>>>>>>>>>> reason for
>>>>>>>>>>> that ?
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>> Not cleaning snapshots
on CEPH and trying to delete
>>>>>>>>>>> volume
>>>>>>>>>>>        after having
>>>>>>>>>>>        >>>> more than 16 snapshtos
in CEPH = Agent crashing on KVM
>>>>>>>>>>>        side...and some
>>>>>>>>>>>        >> VMs
>>>>>>>>>>>        >>>> being rebooted etc -
which means downtime :|
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>> Thanks,
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>> On 9 September 2015 at
22:05, Simon Weller <
>>>>> sweller@ena.com
>>>>>>>>>>>        <mailto:sweller@ena.com>> wrote:
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>>> Andrija,
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> The Ceph snapshot
deletion is not currently
>>>>>>>>>>> implemented.
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> See:
>>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-8302
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> - Si
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> ________________________________________
>>>>>>>>>>>        >>>>> From: Andrija Panic<andrija.panic@gmail.com>
>>>>>>>>>>>        >>>>> Sent: Wednesday,
September 9, 2015 3:03 PM
>>>>>>>>>>>        >>>>> To:dev@cloudstack.apache.org
>>>>>>>>>>>        <mailto:dev@cloudstack.apache.org>;
>>>>> users@cloudstack.apache.org
>>>>>>>>>>>        <mailto:users@cloudstack.apache.org>
>>>>>>>>>>>
>>>>>>>>>>>        >>>>> Subject: ACS 4.5
- volume snapshots NOT removed
>>>>>>>>>>> from CEPH
>>>>>>>>>>>        (only from
>>>>>>>>>>>        >>>>> Secondaryt NFS and
DB)
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> Hi folks,
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> we enounter issue
in ACS 4.5.1 (perhaps other
>>>>>>>>>>> versions
>>>>>>>>>>> also
>>>>>>>>>>>        affected) -
>>>>>>>>>>>        >>>>> when we delete some
snapshot (volume snapshot) in
>>>>>>>>>>> ACS,
>>>>> ACS
>>>>>>>>>>>        marks it as
>>>>>>>>>>>        >>>>> deleted in DB, deletes
from NFS Secondary Storage but
>>>>>>>>>>> fails
>>>>>>>>>>>        to delete
>>>>>>>>>>>        >>>>> snapshot on CEPH
primary storage (doesn even try to
>>>>> delete
>>>>>>>>>>>        it AFAIK)
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> So we end up having
5 live snapshots in DB (just
>>>>>>>>>>> example)
>>>>>>>>>>>        but actually
>>>>>>>>>>>        >> in
>>>>>>>>>>>        >>>>> CEPH there are more
than i.e. 16 snapshots.
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> More of the issue,
when ACS agent tries to obtain
>>>>>>>>>>> list of
>>>>>>>>>>>        snapshots
>>>>>>>>>>>        >> from
>>>>>>>>>>>        >>>>> CEPH for some volume
or so - if number of
>>>>>>>>>>> snapshots is
>>>>>>>>>>> over
>>>>>>>>>>>        16, it
>>>>>>>>>>>        >> raises
>>>>>>>>>>>        >>>>> exception  (and perhaps
this is the reason Agent
>>>>>>>>>>> crashed for
>>>>>>>>>>>        us - need
>>>>>>>>>>>        >> to
>>>>>>>>>>>        >>>>> check with my colegues
who are investigatin this in
>>>>>>>>>>>        details). This
>>>>>>>>>>>        >> number
>>>>>>>>>>>        >>>>> 16 is for whatever
reasons hardcoded in ACS code.
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> Wondering if anyone
experienced this, or have any
>>>>>>>>>>> info
>>>>>>>>>>> - we
>>>>>>>>>>>        plan to
>>>>>>>>>>>        >> try to
>>>>>>>>>>>        >>>>> fix this, and I will
inlcude my dev colegues here,
>>>>>>>>>>> but we
>>>>>>>>>>>        might need
>>>>>>>>>>>        >> some
>>>>>>>>>>>        >>>>> help at least for
guidance or-
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> Any help is really
apreaciated or at list
>>>>>>>>>>> confirmation
>>>>>>>>>>> that
>>>>>>>>>>>        this is
>>>>>>>>>>>        >> known
>>>>>>>>>>>        >>>>> issue etc.
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> Thanks,
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> --
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>> Andrija Panić
>>>>>>>>>>>        >>>>>
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>> --
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>> Andrija Panić
>>>>>>>>>>>        >>>>
>>>>>>>>>>>        >>>
>>>>>>>>>>>        >>>
>>>>>>>>>>>        >>>
>>>>>>>>>>>        >>
>>>>>>>>>>>        >
>>>>>>>>>>>        >
>>>>>>>>>>>        >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>>> Andrija Panić
>>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> ---
>>>>>>>>>> Best regards
>>>>>>>>>> Dmytro Shevchenko
>>>>>>>>>> dshevchenko.mail@gmail.com
>>>>>>>>>> skype: demonsh_mk
>>>>>>>>>> +380(66)2426648
>>>>>>>>>>
>>>>>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message