Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cloudstack.apache.org
Date: Tue, 12 Aug 2014 11:07:12 +0000 (UTC)
From: "Joris van Lieshout (JIRA)" <jira@apache.org>
To: cloudstack-issues@incubator.apache.org
Message-ID: <JIRA.12733447.1407840824016.62544.1407841632128@arcas>
In-Reply-To: <JIRA.12733447.1407840824016@arcas>
References: <JIRA.12733447.1407840824016@arcas>
Subject: [jira] [Updated] (CLOUDSTACK-7319) Copy Snapshot command too heavy
 on XenServer Dom0 resources when using dd to copy incremental snapshots
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris van Lieshout updated CLOUDSTACK-7319:
-------------------------------------------

    Description: 
We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. 
ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd.
Some of the issues we have seen as a consequence of this are:
- DomU performance/freezes
- OVS freeze and not forwarding any traffic
- Including LACPDUs resulting in the bond going down
- keepalived heartbeat packets between RRVMs not being send/received resulting in flapping RRVM master state
- Braking snapshot copy processes
- the xenserver heartbeat script reaching it's timeout and fencing the server
- poolmaster connection loss
- ACS marking the host as down and fencing the instances even though they are still running on the origional host resulting in the same instance running on to hosts in one cluster
- vhd corruption are a result of some of the issues mentioned above
We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files (iflag=direct oflag=direct).
Our test have shown that Dom0 load during snapshot copy is way lower.

  was:
We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. 
ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd.
We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files.
Our test have shown that Dom0 load during snapshot copy is way lower. I will upload the patch on review.


> Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7319
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Snapshot, XenServer
>    Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
>            Reporter: Joris van Lieshout
>            Priority: Critical
>
> We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. 
> ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd.
> Some of the issues we have seen as a consequence of this are:
> - DomU performance/freezes
> - OVS freeze and not forwarding any traffic
> - Including LACPDUs resulting in the bond going down
> - keepalived heartbeat packets between RRVMs not being send/received resulting in flapping RRVM master state
> - Braking snapshot copy processes
> - the xenserver heartbeat script reaching it's timeout and fencing the server
> - poolmaster connection loss
> - ACS marking the host as down and fencing the instances even though they are still running on the origional host resulting in the same instance running on to hosts in one cluster
> - vhd corruption are a result of some of the issues mentioned above
> We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files (iflag=direct oflag=direct).
> Our test have shown that Dom0 load during snapshot copy is way lower.


--
This message was sent by Atlassian JIRA
(v6.2#6252)