cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nik Martin <nik.mar...@nfinausa.com>
Subject Re: Storage failure in not handled well in CS
Date Wed, 03 Oct 2012 18:24:39 GMT
On 10/03/2012 12:03 PM, Ahmad Emneina wrote:
> Hey Nik,
>
> It appears the compute host, or cluster, cant connect to the SAN
> referenced below. Have you peered into the compute hosts logs, they should
> be more informative as to why it cant connect the storage. You should also
> have at least one storage pool up to be able to provision against.
>

Ahmad, If you reference my original post to the list I have two SANs, 
both are primary storage.  One is HD based,and one is SSD based.  I use 
storage tags "HD" and "SSD" respectively.  the HD based san is a single 
20TB volume, with 1 iSCSI target, and 1 LUN.  The SSD SAN is two 5TB 
volumes, each with 1 target, and 1 LUN each, in an Active-Active 
configuration.  The SSD SAN suffered from a misconfiguration issue, so 
we had to put it into maintenance mode in a hurry, and shut it down.  I 
fully expected the Volumes and VMs provisioned on the SSD san to be 
unavailable.  The problem is, Cloudstack continued to try to access 
Volume id 204, which is target 0 on the SSD san.  It shut every VM down, 
and put all Hypervisors into Alert state, and went into a loop trying to 
connect to a volume that is in Maintenence mode.  This creates a very 
bad situation for me and my customers.

Regards,

Nik

> On 10/3/12 6:51 AM, "Nik Martin" <nik.martin@nfinausa.com> wrote:
>
>> Bump?  This is a serious issue that I need to get resolved.  An entire
>> cloud going down while one SAN is being repaired is a bad thing.  My
>> cloud controller still refuses to start VMs because it cannot connect to
>> a SAN that is in maintenance mode and is offline.
>>
>>
>> On 10/02/2012 03:12 PM, Nik Martin wrote:
>>> I have two SANs connected to CS as primary storage.  One is an HD based
>>> SAN, with a single target and LUN, and the other is an SSD SAN split
>>> into two volumes, each connected with a target and LUN.  The HD san is
>>> where all system VMs are stored (or they were before I added the HD SAN,
>>> but I have no ide where the system vm volumens are stored).  This
>>> morning, I had to do a semi emergency shutdown of the SSD SAN, so I put
>>> both LUNS in emergency maintenance mode in CS.  CS shutdown the entire
>>> cloud, not just the volumes stored in the SSD san.  The san is offline,
>>> and CS shows it in maintenance mode, but NO vm's will start, and the cs
>>> management log shows:
>>>
>>> onnecting; event = AgentDisconnected; new status = Alert; old update
>>> count = 959; new update count = 960]
>>> 2012-10-02 15:10:40,370 DEBUG [agent.manager.ClusteredAgentManagerImpl]
>>> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
>>> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
>>> (AgentTaskPool-2:null) Unable to connect due to
>>> com.cloud.exception.ConnectionException: Unable to connect to pool
>>> Pool[204|IscsiLUN]
>>>       at
>>>       at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
>>> a:603)
>>>
>>>       at java.lang.Thread.run(Thread.java:679)
>>> Caused by: com.cloud.exception.StorageUnavailableException: Resource
>>> [StoragePool:204] is unreachable: Unable establish connection from
>>> storage head to storage pool 204 due to ModifyStoragePoolCommand add
>>> XenAPIException:Can not see storage pool:
>>> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on
>>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
>>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>>>       at
>>>
>>> com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageManag
>>> erImpl.java:1567)
>>>
>>>       at
>>>
>>> com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePoolM
>>> onitor.java:88)
>>>
>>>       ... 8 more
>>> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
>>> Transition:[Resource state = Enabled, Agent event = AgentDisconnected,
>>> Host id = 6, name = hv1]
>>> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
>>> Agent status update: [id = 6; name = hv1; old status = Alert; event =
>>> AgentDisconnected; new status = Alert; old update count = 960; new
>>> update count = 961]
>>>
>>>
>>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that is
>>> in maintenance mode, so why is CS still trying to connect?  All my HVs
>>> are in alert state becasue of this.
>>>
>>
>>
>> --
>> Regards,
>>
>> Nik
>>
>> Nik Martin
>> VP Business Development
>> Nfina Technologies, Inc.
>> +1.251.243.0043 x1003
>> Relentless Reliability
>>
>
>


-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Mime
View raw message