Return-Path: X-Original-To: apmail-cloudstack-dev-archive@www.apache.org Delivered-To: apmail-cloudstack-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A7205CEC5 for ; Tue, 12 Aug 2014 11:22:10 +0000 (UTC) Received: (qmail 47782 invoked by uid 500); 12 Aug 2014 11:22:05 -0000 Delivered-To: apmail-cloudstack-dev-archive@cloudstack.apache.org Received: (qmail 47735 invoked by uid 500); 12 Aug 2014 11:22:05 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 47700 invoked by uid 99); 12 Aug 2014 11:22:04 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Aug 2014 11:22:04 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 9D1991DB6D3; Tue, 12 Aug 2014 11:21:50 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4060828057914080426==" MIME-Version: 1.0 Subject: Re: Review Request 24598: Copy Snapshot command too heavy on XenServer Dom0 resources when using dd top copy incremental snapshots From: "Joris van Lieshout" To: "anthony xu" , "Hugo Trippaers" , "Alex Huang" , "daan Hoogland" , "edison su" , "Kishan Kavala" , "Min Chen" , "Sanjay Tripathi" Cc: "Joris van Lieshout" , "cloudstack" Date: Tue, 12 Aug 2014 11:21:50 -0000 Message-ID: <20140812112150.10843.68645@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Joris van Lieshout" X-ReviewGroup: cloudstack X-ReviewRequest-URL: https://reviews.apache.org/r/24598/ X-Sender: "Joris van Lieshout" References: <20140812110553.10844.9283@reviews.apache.org> In-Reply-To: <20140812110553.10844.9283@reviews.apache.org> Reply-To: "Joris van Lieshout" X-ReviewRequest-Repository: cloudstack-git --===============4060828057914080426== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24598/ ----------------------------------------------------------- (Updated Aug. 12, 2014, 11:21 a.m.) Review request for cloudstack, Alex Huang, anthony xu, daan Hoogland, edison su, Kishan Kavala, Min Chen, Sanjay Tripathi, and Hugo Trippaers. Bugs: CLOUDSTACK-7319 https://issues.apache.org/jira/browse/CLOUDSTACK-7319 Repository: cloudstack-git Description (updated) ------- We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd. Some of the issues we have seen as a consequence of this are: - DomU performance/freezes - OVS freeze and not forwarding any traffic - Including LACPDUs resulting in the bond going down - keepalived heartbeat packets between RRVMs not being send/received resulting in flapping RRVM master state - Braking snapshot copy processes - the xenserver heartbeat script reaching it's timeout and fencing the server - poolmaster connection loss - ACS marking the host as down and fencing the instances even though they are still running on the origional host resulting in the same instance running on to hosts in one cluster - vhd corruption are a result of some of the issues mentioned above We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files (iflag=direct oflag=direct). Our test have shown that Dom0 load during snapshot copy is way lower. We believe Hot-fix 4 for XS62 sp1 contains a similar fix but for the sparse dd process used for the first copy of a chain. http://support.citrix.com/article/CTX140417 == begin quote == Copying a virtual disk between SRs uses the unbuffered I/O to avoid polluting the pagecache in the Control Domain (dom0). This reduces the dom0 vCPU overhead and allows the pagecache to work more effectively for other operations. == end quote == Diffs ----- scripts/vm/hypervisor/xenserver/vmopsSnapshot 5fd69a6 Diff: https://reviews.apache.org/r/24598/diff/ Testing ------- We are running this fix in our beta and prod environment (both using ACS 4.3.0) with great success. Thanks, Joris van Lieshout --===============4060828057914080426==--