Return-Path: X-Original-To: apmail-incubator-cloudstack-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-cloudstack-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5453EC08 for ; Mon, 28 Jan 2013 20:39:14 +0000 (UTC) Received: (qmail 54206 invoked by uid 500); 28 Jan 2013 20:39:14 -0000 Delivered-To: apmail-incubator-cloudstack-dev-archive@incubator.apache.org Received: (qmail 54173 invoked by uid 500); 28 Jan 2013 20:39:14 -0000 Mailing-List: contact cloudstack-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cloudstack-dev@incubator.apache.org Delivered-To: mailing list cloudstack-dev@incubator.apache.org Received: (qmail 54165 invoked by uid 99); 28 Jan 2013 20:39:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 20:39:13 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shadowsor@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 20:39:09 +0000 Received: by mail-vc0-f176.google.com with SMTP id fy27so2136887vcb.35 for ; Mon, 28 Jan 2013 12:38:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=wEKcdhu9Dt1Jlv9kiQWgoZPfFLpPgOh1XpmEdjdaVJk=; b=F9X/9aSdxZDogbALLH8TuE1G3Z8YC/V5HajWtIwWeXm0I+jUhLDokyJfKbwvwp4zj9 AJjRUdQ+B/DLTIkGo1LhTV/HfFYSiRb5l8frjkx0QbLp8927AKkT5ATwOzrPWZIKs04Z Fv6e051zMFW7WF2Enlo9QJgNWzgP+VtXSp+Z9NFGr10MxJR0x5S0RQ5IVlxnkaK22K/U XW4yzP7CItdxRNX8dTJ6gbU/0Mo2D7Jrl7JQVUnlBsPJ1TvTNc0pmd4QxOVq+QooEGG4 6O2rHynNO/Bap77+6x3SROAPxLCgeeKs26TZMhgrz5uIapa5mTfxIsE8Vt1r7gSWyX8v uDQw== MIME-Version: 1.0 X-Received: by 10.52.34.75 with SMTP id x11mr9536538vdi.117.1359405528615; Mon, 28 Jan 2013 12:38:48 -0800 (PST) Received: by 10.58.145.135 with HTTP; Mon, 28 Jan 2013 12:38:48 -0800 (PST) In-Reply-To: References: <51031D54.4090706@betterservers.com> Date: Mon, 28 Jan 2013 13:38:48 -0700 Message-ID: Subject: Re: Review Request: API commands to add or remove NIC from a VM From: Marcus Sorensen To: cloudstack-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Jan 28, 2013 at 12:44 PM, Marcus Sorensen wrote: > On Mon, Jan 28, 2013 at 12:31 PM, Chiradeep Vittal > wrote: >> Inline >> >> On 1/25/13 4:03 PM, "Brian Angus" wrote: >> >>>My answers inline... >>> >>>On 01/24/2013 06:49 PM, Marcus Sorensen wrote: >>>> Brian is probably the right guy to answer some of these, but I'll chime >>>>in >>>> on a few. >>>> >>>> On Thu, Jan 24, 2013 at 5:58 PM, Chiradeep Vittal < >>>> Chiradeep.Vittal@citrix.com> wrote: >>>> >>>>> Thanks for this. >>>>> A few comments: >>>>> 1. These mutating operations have to be async and hence the API >>>>>commands >>>>> have to inherit from BaseAsyncCmd instead of BaseCmd >>>>> >>>> >>>> We can do that >>>This change has been pushed to the branch >> >> I saw this. We have 2 choices: extend from BaseAsyncCreateCmd or >> BaseAsyncCmd. I see that you have chosen the latter. If the former: you >> synchronously create the nic object first and obtain an id (uuid), and >> then asynchronously attach the nic. In the current implementation both the >> db entity creation and the attach are in the asynchronous workflow. If the >> attach fails in the backend, then there will be a nic object in the >> database theoretically attached to the vm, but not really attached. >> Choices: >> A) change the data model to model the fact that the nic is attached or not >> (and use the BaseAsyncCreateCmd approach) >> B) ensure that the nic is destroyed when the attach fails. >> >> Note that with (B) you have a potential race condition where an API client >> can retrieve the user vm before the attach has succeeded/failed and then >> potentially add a static nat rule. > > This is an interesting issue. If it fails to attach, we don't > necessarily want it to back out. In theory we only bother with the > attach because the VM is running, if it weren't then it would be > sufficient to just create the database entry. The existing code (e.g. > deployVirtualMachine) doesn't delete your nics if they fail to apply > when the VM starts up, correct? > >> >>>> >>>> >>>>> 2. The functional specification is rather sparse -- I.e., it is not of >>>>> much use for a QA person to develop a comprehensive set of test cases. >>>>>I >>>>> recommend taking a look at >>>>> >>>>>https://cwiki.apache.org/confluence/display/CLOUDSTACK/Design+Document+T >>>>>emp >>>>> late to see what more information one could put in there. For example, >>>>> edge cases (remove the last nic / exceed nic limit), configuration >>>>> parameters. >>>I have added more details to the functional spec. Probably still needs >>>more work... >> >> Yes, please add any information about limits, usage data, logs, failure >> conditions and any OS dependencies. >> >>>>> 3. There's todos in the code: either this code is important and cannot >>>>> ship without the TODO or is a nice to have. It is hard for someone to >>>>> figure out which case it is. Regardless if the TODOs are not fixed, >>>>>they >>>>> should have bugs >>>>> >>>> >>>> I'll have to revisit the code, but I know at least a few that were >>>> inherited from the existing code base (things that were borrowed). Also >>>>I >>>> know of one or two that were just notes to myself of things to >>>>investigate, >>>> I thought I had removed those but I'll double check. >>>Some of these have been addressed I'll work on the ones that still exist. >>>> >>>> >>>>> 4. Questions in my mind: >>>>> A. Do you update instance metadata during this operation? >>>>> B. Verify IP address is not already used / in range / etc >>>>> >>>> >>>> We don't assign the IP, we call the same code that creates nics in the >>>> first place, and the IP is assigned via the same functionality that >>>> currently exists. If there's a problem with using IPs that are already >>>> used, it's a problem for deployVirtualMachine as well. >>>> >> >> If so, then it is a bug for both use cases. Doesn't it mean we shouldn't >> fix at least one of the use cases? For example, if the network subnet is >> 10.1.1.0/24 and I try to create a nic with IP 75.75.75.75/32? I am >> referring to the 'ipaddress' parameter in the AddNicToVMCmd. > > We'll test this. My point was that our code will react the same way to > your example as deployVirtualMachine does when you give it a bogus IP. > If there's an issue there it's probably best handled as a bugfix in > the upstream functions than patching a special fix for this branch. This isn't currently an issue for either scenario. Bogus IP returns this: Insufficient capacity when adding NIC to VM[User|testVM]: com.cloud.exception.InsufficientVirtualNetworkCapcityException: Unable to acquire Guest IP address for network Ntwk[205|Guest|8]Scope=interface com.cloud.dc.DataCenter; id=1 > >> >> >>>> E. Verify live migration after operation? >>>>> >>>> >>>> This has been done for KVM, could be tested for Xen, but the VM >>>>definition >>>> looks good. >>>I have tested this on XEN devcloud as well >> >> I didn't know you could do live migration on devcloud? > > Why not? bring up another devcloud, add it as a second host. > >> >>>> >>>>> F. Operation blocks / fails during live migration / HA / start / stop >>>>> >>>> >>>> Brian recently changed some of the code related to this, but previously >>>>the >>>> VM state had to be 'stopped', not 'starting', 'stopping', 'running', or >>>> 'migrating'. >>>The API calls will fail if the VM is not in a Running or Stopped state. >> >> That closes some holes. But since the async operation does not have a >> state machine, it is still possible for someone to migrate / stop the VM >> while the nic attach is in progress. It may not matter (and might actually >> work). I would note this as a limitation (bug) and hope that Simon takes >> care of this in his feature. >> >>>> >>>> >>>>> G. Usage events for remove / add operations >>>>> >>>> >>>> I'm unfamilar with usage events. The usages for new nics are tracked, >>>> though. >>>> >>>> >>>>> H. Interaction of removeNic with extant LB / PF / Static NAT/ FW rules >>>>> >>>> >>>> This is a good one to verify, does removing a nic orphan a port >>>>forwarding >>>> rule, etc, or does removing the nic clean up the rule? >> >> Unlikely that any clean up is going to happen. > > There's actually a lot of cleanup related stuff in current master that > needs to be addressed, e.g. bridges don't go away when they're no > longer used, and VM names aren't removed from VR's hosts table when > VMs are removed (maybe this has been recently fixed? I'll look into > it). We'll get it taken care of if that bug exists. > >> >>>> >>>> >>>>> I. Interaction of addNic / updateNic with PF/Static NAT/Source NAT -- >>>>> e.g., the default route might remain the same inside the VM leading to >>>>> head scratching as to why a specific PF/NAT rule does not work. >>>>> >>>> >>>> This is a non-issue, in my opinion. It's similar to attaching a volume >>>>and >>>> then wondering why it's not formatted and mounted for us right where we >>>> wanted it. People have all sorts of varying uses for multiple networks, >>>>and >>>> it's up to them to configure what they intend. It would be good to make >>>>a >>>> note in the documentation though, people should know that we don't have >>>> much control over their VM's OS config and they need to treat it like >>>>any >>>> other server. >> >> The FS is a good place to document this. >> >> Thanks >> -- >> Chiradeep >> We appreciate your feedback Chiradeep, and helping us to make sure we account for all scenarios. Do you think we're far enough along on this to get it into master? We are just finishing up a few tests that will be added into the branch but I hope we're to they point that any lingering outliers can be bugfixes post code freeze.