cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Huser <ahu...@7five-edv.de>
Subject Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
Date Wed, 24 Oct 2012 21:51:22 GMT
Hi Trevor, 

1.) that is not a good configuration. The ZIL (slog) must be a mirror. when this 60GB SSD drive crashed you lost up to 60GB data! 

2.) the second bad idea is FreeBSD, FreeBSD is good but it's not usable for a storage. 

my recommendation use enough RAM min 8GB better 12GB or more. 
Install four harddrives and create two mirrors in pool. 
eg. zfs create tank mirror disk1 disk2 mirror disk3 disk4 
then you have two vdevs and every vdevs will striped. 
When you have two vdevs this like a raid10. 
Have you three vdevs then the data striped over three vdevs (more then raid10) etc. etc. 
Every vdev bring more throughput IOPS and performance. 

Would you not buy Solaris then use OpenIndiana, but use version oi_151a3! Thats important! 
Newer versions of OI have a new ZFS Pool version no longer compatible with standard ore other zfs systems. 
Solaris version 33, old OI and FreeBSD 28 and new OI have 1000. You can upgrade the Pool Version but not downgrade. 

eg. You buy new hardware with more slots for harddrives and want use solaris with premier support to use it for productiv enviroments. 
Drives you can not put in to the new machine and do not an zpool import! That not work because the pool version is higher! 

i use OI 151a3 for a 40TB Mirrored Storage with glusterFS (18TB Usable). The Machine runs stable. Not one restart are needed. 

can you tell me your hardware and plan for Storage connection and Cloudserver? 
Which environment you plan? 


greetings from germany 
Andreas 



----- Ursprüngliche Mail -----

Von: "Trevor Francis" <trevor.francis@tgrahamcapital.com> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Mittwoch, 24. Oktober 2012 22:24:04 
Betreff: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT 

So it is more and more looking like I am going to be using ZFS over FreeBSD. 


I am space limited on my storage server to 4 drives. I was thinking of putting 2 x 1TB 10k RPM 64-mb Cache SATA drives in and a 60GB SSD for Zil and 256G SSD for L2ARC. Most of our VMs will have very small (1k) writes, but they will grow into the thousands per second. Any recommendations for or against this setup? I would put FreeBSD on a bootable flash drive. 


Also, unrelated, will XS 6.1 work over CS4? 










Trevor Francis 
Partner 
46 Labs | The PeerEdge Cloud 
http://www.46labs.com | http://www.peeredge.net 
405-362-0046 - Voice | 405-410-4980 - Cell 
trevorgfrancis - Skype 
trevor@46labs.com 

Solutions Provider for the Telecom Industry 




On Oct 24, 2012, at 1:20 PM, Clayton Weise < cweise@iswest.net > wrote: 





You’re going to see a dip in performance with KVM primarily because of the qcow2 format (KVM people please correct me if I’m wrong because I’m new to KVM). Due to time constraints I wasn’t able to run any tests with writeback cache turned on or off but I’ve seen people saying both in forums that having it on improves performance, and having it off improves performance to who knows. 

What I can tell you is that in my tests with our current setup I was able to achieve about ~4-5k IOPS for random reads/writes using a multitude of different benchmark utilities in both Linux and Windows with KVM’s raw disk format. When I switched over to qcow2 it went down to about 2-2.5k. With the way our test RAID system is constructed our theoretical max is about 5k IOPS so I’m not sure if the drop is something related to the overall potential (e.g. 50% of the total) or simply a hard cap of that number but I haven’t had a chance to really kick things around much. This test was done using an iSCSI LUN mounted from a Compellent SAN on 10gbit Ethernet and OCFS2 on top of it. I know there was a comment earlier saying that OCFS2 was old but what I can say about it is that it was extremely easy to configure and get up and running and seemed to be pretty solid when we tested it about a year ago as well as the other week. 

Regarding your ZFS on Linux question. ZFS on Linux just isn’t there yet. It’s too new and the work that’s being done on it right now is primarily around stability and integrating features and not about performance so if you’re going to do ZFS do it on either BSD or Solaris/Illumos. The alternative to that is the hardware equivalent with something like LSI’s cachecade which will do basically the same thing and it will work just fine with Linux. 



From: Trevor Francis [mailto:trevor.francis@ tgrahamcapital.com ] 
Sent: Wednesday, October 24, 2012 8:27 AM 
To: cloudstack-users@incubator.apache.org 
Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT 

So, since ZFS is super slow on Centos, would it be better to use ext4 and bcache over the native linux NFS daemon? 



Also, I am reading the KVM is much more disk performant using writeback caching. Does Cloudstack deploy the KVM VM instances with this enabled by default? 








Trevor Francis 

Partner 

46 Labs | The PeerEdge Cloud 

http://www.46labs.com | http://www.peeredge.net 

405-362-0046 - Voice | 405-410-4980 - Cell 

trevorgfrancis - Skype 

trevor@46labs.com 



Solutions Provider for the Telecom Industry 






On Oct 24, 2012, at 8:50 AM, Nik Martin < nik.martin@nfinausa.com > wrote: 



On 10/24/2012 08:29 AM, Trevor Francis wrote: 


Seems my original post has taken a detour. Who knew storage could be so 
controversial! 

Since we are a Centos house and it would be nice to have a standard OS 
across our platform, but many people have been singing ZFS' praises. To 
bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS 
for linux. 
Does anyone have experience with this? Would it be better to deploy 
Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or 
bcache? 


I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote I got was "ZFS is slow on Linux". I then asked "Slow relative to what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the response was "yes", so I'm being led to believe that it does not perform as well as other filesystems. For large file storage on Linux, which Xen VHDs are, I tend to use XFS if the network protocol is NFS. 



Also, is the standard linux NFS daemon sufficient to handle a 
significant load or do you recommend a different NFS daemon....or go iSCSI ? 
Our biggest concern is that we have most VMs doing many sequential small 
writes (log files). 

When I first started with CS on Xenserver in January, I thought thin provisioning was important, so I built file system based storage using NFS, with XFS as the underlying filesystem, on LVM. NFS (v3) on Centos 6 became processor bound VERY quickly, running %300 CPU load on a quad core Xeon. 

I switched to iSCSI with LVM as backing store, and havwe had zero issues, except I'm currently investigating lower than desired random read/write on < 4k random. Using tests like Iozone in a VM, and running iostat -t1 on the hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to resolve this soon though. 




Thoughts before I make a decision? 



Trevor Francis 
Partner 
46 Labs | The PeerEdge Cloud 
http://www.46labs.com < http://www.46labs.com/ > | http://www.peeredge.net 
< http://www.peeredge.net/ > 
405-362-0046 - Voice | 405-410-4980 - Cell 
trevorgfrancis - Skype 
trevor@46labs.com < mailto:trevor@46labs.com > 
Solutions Provider for the Telecom Industry 

< http://www.twitter.com/peeredge >< http://www.twitter.com/peeredge >< http://www.twitter.com/peeredge >< http://www.facebook.com/PeerEdge > 

On Oct 24, 2012, at 2:53 AM, Fabrice Brazier < fabrice.brazier@apalia.net 
< mailto:fabrice.brazier@apalia.net >> wrote: 



Hi, 

@bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i 
or the 8i mean 8 internals ports (so with 2 lsi you can have only 16 
disks). 

After that a velociraptor disk it's only a 10k discs so we can count 120 
IOPs by disks. If we put them in a raid 5 (by the way everybody know that 
raid 5 don't scale very well after 12 disk the performance doesn't 
increase 
so much but nevermind you built your raid 5 of 30 disks). 
So you have a maximum of (30-1)*120=3480 iops for read and 
((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only 
32Gb ram wich is peanuts to caching your read iops. 

So the only way you have to make performance is your cahecade. 
Now a cachecade is made of ssd (weird actually it's the same as zil and 
l2arc in zfs xD) plus the max size of a cachecade and the maximum size 
of a 
cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade 
max. 

So you say your storage doesn't cost anything but you don't talk about the 
size and the price of your cachecade. 
And another thing the day you will have a real workload (not only 50 
Vms...) 
and this workload will be larger than 1Tb come again and explain us your 
very large performance. 

Me too I can built a zfs server with 4 zeusram in zil and 5 disks 
behind and 
make a benchmark on files of 2Gb and say hey see I have 500 000 Iops 
and 4Gb 
of throughput... 

Regards 

-----Message d'origine----- 
De : bruce.m@v365.com.au < mailto:bruce.m@v365.com.au > 
[mailto:bruce.m@ v365.com.au < http://v365.com.au >] 
Envoyé : mercredi 24 octobre 2012 06:04 
À : cloudstack-users@incubator.apache.org 
< mailto:cloudstack-users@incubator.apache.org > 
Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT 



These Sans we have build have saved us a fortune $ compared to FC 
Block I/O 
Sans from IBM, HP etc 

Ive been a contractor for IBM and HP 
for 20 years. I know how much they are $$$$$ and what they can do. 

A 
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your 
lucky you may get 100,000 perhaps 150,000 IOPS 

What ive build is 
better. cheaper, and frankly as i support Linux and SCST, much more 
reliable. 

the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800 
($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam 
$800, 
Dual PSU $650, about $14,000 for a much superior SAN. 

As i explained another $15K for infiniband network. which does SRP and IP 

As for your issues with cache misses, the CacheCade doesn't Miss much from 
our tests. with WriteBack enabled, BBU enabled, and the SANs are on our 
primary 22KVA UPS It never fails. 

Also Its not 
doing very much unless a user in the cloud is hitting sequential blocks or 
R/W the same data repeatedly for some weird reason. Even then we see 
little 
lagg or slow down, when as you say its hitting the Disc. 
Infiniband with LSI Just rocks even more than we had thought! 

The R/W 
pattern from groups of 50 VM.s appear to be completely random, web 
servers, 
virtual firewalls, email servers, DNS etc, FTP, windows domains, 
sharepoint, 
MSSQL, MYSQL working on the same Lun which are not considered critical (as 
stated by clients) are each VM getting over 300-1200MB/s constantly even 
when other users are pulling or pushing a file copy @ 250Mb/s between VMs 
etc. Weve actually had to figure out how to slow down the SRP connections. 
If a user is provisioned with 2 Luns from each SAN, in windows you can 
copy 
between VM Drives @ 1220MB/s!!!!! 
faster than SSD in Linux its even faster, and that is in ESXi4.1 with 
15~20 Servers per VM Host. 

we have snmp monitoring the 2 primary sans, and they never go above 30% 
bandwidth. so the point is the Cache has no real bearing on Cloud I/O. 
cloud 
IO is fairly low. We have LSI supplying us with the new PICE3 cards for a 
trial, so i expect even better performance. So long STORY SHORT. GET 
LSI AND 
16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s 
SSD's its SLOWER! MORE SPINDLES = MORE SPEED! 

the main reason we wanted fast IO is for backups, snapshots, so as not to 
interfere with the daily operational workings of the cloud VM guests. 
Which 
is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will 
work, and some smart ass does a restore or backup in prod hours, only 
to see 
the whole cluster slow down to a crawl. on iSCSI its even more pathetic. 

With SRP on LSI " WE DONT 
HAVE THAT PROBLEM!" 

I did a snapshot restore for 2 machines yesterday for a client, it took 2 
mins to restore 100GB. And none of the other servers were affected. SNMP 
latency on all VM's no impact. Virtually 0CPU cycles on the VMware 
hosts for 
the IO move. IO latency/storage path for other VM's from snmp data from 
VMserver was 2ms which is normal. the machine doing the copy crept up to 
10ms but no more. 

So my conclusion 
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING! 

Cache misses we don't have a problem there at all. Were more worried about 
Data Rot, which is why were interested in GFS to ZFS Luns. 
But ZFS is still slow as compared to what we have..... 

we are going to 
launch a new script which moves a file full of 16-bit random io for 100MB, 
and 1TB, it will create a binary checksum for each file, and we can 
compare 
it over time to check for data rot. 

Just so you can 
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts. 
the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started 
the 2nd, while the 1st was running, same results across the 2 VM's. 
then the 
3rd concurrently. they all dropped to 2000MB/s and IOPS went down to 
220,000 
per host. so yes there is a penalty. But the way infiniband works is that 
the switch balances the load across all servers. 4th server online same 
test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s 
@ 128K 
blocks 50%50% random seq. change the distribution to 100% random makes 
little difference. We confirmed all the drives were being accessed! all 
were. we even took a look at the camera out DC to monitor the array 
LEDS... 
alll Flickers like mad! we took a look at the metrics of the Megaraid 
as we 
got it to work on CentOS, no errors, no issues or problems, no dropped 
reads 
or writes. 

we set the file on IO meter as high as poss 900000000 units. We then fired 
up 2 more hosts and did real world file copy, no real impact still get 
2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent 
file copies between the 2 sans and each host blew us away at 2000MB/s read 
and write across all 6 hosts.!! FXXK we were amazed 

Aside 
from that LSI have done a super job on the Cache modules on their cards. 
I dont see how ZFS will ever approach this kind of performance. 

Anyway 
take from what ive said what you like. We have an awsum SAN... and its 
CHEAP 
and Amazingly fast! 

were in the process of developing a public Cloud, so when the web site is 
finn we will publish all of this in a whitepaper there. 

On 24.10.2012 10:57, Jason Davis wrote: 


Passionate 
are you? Take it easy I am not knocking on what you are doing. 



The 
term "spinning rust" is a catch all for hard drives, nothing personal 



at all, in fact the setup you have is impressive. 



Impressive until 
you have a cache miss. Cache misses are definitely an 


issue even in 
"capital E" enterprise storage. 



Big bad XIV disk arrays or 3par are 
impressive, fast arrays (crazy 


expensive ;]) when you are hitting DRAM 
or SSD but if you are running cache 


unfriendly workloads then 
ultimately your array is only as fast as the 


spinning disk sitting at 
the lowest tier. 



I agree that openstorage is the way to go, from a 
service providers 


perspective it's hard being competitive with AWS or 
RackSpace if you are 


using expensive traditional SAN. 

So anyway, 
my question wasn't a poke at what you are doing, just curious 


about 
how big of an IOP/latency drop you see once a read or write has to 



actually see the spinning disk. :) 


On Oct 23, 2012 8:10 PM, wrote: 



<blockquote>



oh well if your negative, and make rude comments then that's 


your 


<blockquote>

problem, i don't care about you. 

If you don't believe that 
</blockquote>

RDMA can 


<blockquote>

perform this fast then again, your problem. 

Ask anyone 
</blockquote>

with a decent 


<blockquote>

12+ Disc Raid Array, and RDMA and they will tell you it 
</blockquote>

will kick you in 


<blockquote>

YOUR ballz! Anyone in storage knows to get iops and 
</blockquote>

performance you need 


<blockquote>

12 preferably 16 spindles. 

the spinning 
</blockquote>

rust as you call it, are the 


<blockquote>

newest 200MB/s WD Veloci Raprors with 
</blockquote>

64MB Cache. @ 10KRPM 2.5" dics in 


<blockquote>

32 Bay Storage Chassis. 

we've 
</blockquote>

used them in the past and they are as 


<blockquote>

reliable as the SAS drives we 
</blockquote>

use, but faster! 


<blockquote>


RAID 10 IS BULLSHIT! 
Bloody slow and 50% goes 
</blockquote>

up in smoke for nothing. All you get is the 


<blockquote>

mirror speed of 2 drives 
</blockquote>

which is barley 400MB/s tops! I Spit on RAID 10 


<blockquote>



We also tried 
</blockquote>

Raid 60 and it was very good as well. But no point to 


<blockquote>

use it as we 
</blockquote>

replicated to 2nd SAN 


<blockquote>


When we tested our new SANs, we 
failed 1 
</blockquote>

drive and set the rebuild rate to 30% on raid 6, with 40% scan 


<blockquote>

rate. 
</blockquote>

The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we 


<blockquote>


</blockquote>

removed a 2nd disc, and replaced with blank, and it was rebuild in 


<blockquote>


</blockquote>

another 4 hours. With no real impact to performance tests. 


<blockquote>


</blockquote>

Separating 


<blockquote>

the underlying 6Gb/s Sata ports x 8 into effectively 16 x 
</blockquote>

3Gb/s Channels 


<blockquote>

using a SAS Expander gives each Disc 300MB/s of 
</blockquote>

bandwidth. They cant 


<blockquote>

physically perform better than 200~220MB/s so 
</blockquote>

there is enough bandwidth 


<blockquote>

on the SATA and PCIE Bus to cope. LSI rate 
</blockquote>

the 9280-8i @ 2500MB/s but it 


<blockquote>

does work faster with Cache Cade. up to 
</blockquote>

3200MB/s in that test. 


<blockquote>


So a 
real SAN has many spindles of high 
</blockquote>

performance, and the WD VRs are 


<blockquote>

better than some cheap SSD 
</blockquote>

drives. 


<blockquote>


RDMA is very fast, low cpu cycles, 
and reads and write 
</blockquote>

directly to RAM@ 40Gb/s, we created a RAM drive and 


<blockquote>

tested it up to 
</blockquote>

3200MB/s which is as fast as the PCIE 2 Bus / LSI could 


<blockquote>

Handle. 


</blockquote>

If you cant afford this realtivley cheap equipment (compated to 


<blockquote>

IBM 
</blockquote>

DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast 


<blockquote>


</blockquote>

performance. Any real Cloud would have what we have built. None of our 


<blockquote>

customers, ALL 300 Of them! have lost a single bit of data, all 
</blockquote>

VMs are 


<blockquote>

running very quickly, and with no problems now for 3 
</blockquote>

months. 


<blockquote>


So If you 
dont appreciate being shown how to build a 
</blockquote>

kick ASS SAN then PISS 


<blockquote>

OFF! 

On 24.10.2012 08:17, Jason Davis 
</blockquote>

wrote: 


<blockquote>

<blockquote>

How often do you have 
</blockquote>

folks cache missing and hitting the 
</blockquote>

spinning rust 


<blockquote>

<blockquote>

below the cache cade 
</blockquote>

tier? 


On Oct 23, 2012 
</blockquote>

6:25 PM, wrote: 


<blockquote>

<blockquote>






Hi 

with our 
</blockquote>

SCST 
</blockquote>

RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid 


<blockquote>

<blockquote>

<blockquote>

Director 4036 
</blockquote>

</blockquote>


</blockquote>

switches + HP Gen2 QDR Chassis Switches. 


<blockquote>

<blockquote>

<blockquote>


Our San achieves 
</blockquote>

</blockquote>



295,000 - 350,000 Iops max per LSI controller, depending on 
</blockquote>

block 


<blockquote>

size 


<blockquote>

for i/o meter test. LSI9280-8i with battery cache. 
</blockquote>

</blockquote>

+ 


<blockquote>

cachecade2.0 x 2 in 


<blockquote>

PCIE 16x slot on supermicro 
</blockquote>

</blockquote>

motherboard. 


<blockquote>

<blockquote>

<blockquote>


1 
</blockquote>

</blockquote>

x 250GB SSD for CACHE Cade + 


<blockquote>

16 1tb 
</blockquote>

</blockquote>

WD VR drives (16 - hot stand buy 


<blockquote>

- 2 for raid 6 parity) = 


<blockquote>


</blockquote>

</blockquote>

2600MB/s 


<blockquote>

<blockquote>

<blockquote>


most of our test are from 4K 
</blockquote>

</blockquote>

to 128K block size, 
</blockquote>

@ 4K we get 


<blockquote>

<blockquote>

<blockquote>

295,000 Iops @ 1590 MB/s and @ 128K 
</blockquote>

</blockquote>

350,000 iops @ 
</blockquote>

2400~2600MB/s 


<blockquote>

<blockquote>

<blockquote>


we 
have tuned 64K and 128K block 
</blockquote>

</blockquote>

size 
</blockquote>

on different luns. 64K for database 


<blockquote>

<blockquote>

<blockquote>

and 128K for general 
</blockquote>

</blockquote>


</blockquote>

file. 


<blockquote>

<blockquote>

<blockquote>


The best thing about Infiniband is low CPU 
</blockquote>

</blockquote>

</blockquote>

cycles. only 


<blockquote>

5~6% during these tests. 


<blockquote>


latency is as low 
</blockquote>

</blockquote>

as 4-6ms 


<blockquote>

<blockquote>

<blockquote>

average 
</blockquote>

</blockquote>

read time. concurrent response times are from 
</blockquote>

5-12ms. even under 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

heavy load its below 20ms 


<blockquote>


</blockquote>

</blockquote>

Infiniband latency is below 0.01 us 


<blockquote>

Which is 


<blockquote>

why we chose 
</blockquote>

</blockquote>

it. 


<blockquote>

<blockquote>

<blockquote>


we run MSSQL on the 64K formatted 
</blockquote>

</blockquote>

Luns, its 
</blockquote>

massively 


<blockquote>

<blockquote>

<blockquote>

fast. 

If we copy the same data twice the 
</blockquote>

</blockquote>


</blockquote>

cache cade kicks in and we 


<blockquote>

<blockquote>

<blockquote>

achieve even better speeds. 


</blockquote>

</blockquote>

</blockquote>

Ive 


<blockquote>

compared this to block io @ 8Gb/s Fiber 


<blockquote>

channel, and it 
</blockquote>

</blockquote>

barely gets 


<blockquote>

120,000 IOPS. @ much higher latency and 


<blockquote>


</blockquote>

</blockquote>

bandwidth. 


<blockquote>

<blockquote>

<blockquote>


So for our 
</blockquote>

</blockquote>

money RDMA wins! 


<blockquote>


</blockquote>

</blockquote>

Bruce 


<blockquote>

<blockquote>

<blockquote>


On 23.10.2012 21:20, 
Jason Davis 
</blockquote>

</blockquote>

wrote: 


<blockquote>


</blockquote>

</blockquote>

Out of curiosity, is there any quick performance 


<blockquote>

<blockquote>

numbers 
</blockquote>

</blockquote>

for 
these ZFS + 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

GlusterFS mashups you guys are talking 
</blockquote>


</blockquote>

</blockquote>

</blockquote>

about? 


<blockquote>

<blockquote>

<blockquote>


</blockquote>




<blockquote>

Specifically, IOPs and latency? Sequential 
</blockquote>

</blockquote>

</blockquote>

read/write 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

performance honestly 


<blockquote>

<blockquote>

isn't a very good 
</blockquote>

</blockquote>

</blockquote>

benchmark to determine 


<blockquote>

your 


<blockquote>

SANs performance. It's like 



</blockquote>

</blockquote>

comparing CPUs based solely on 


<blockquote>

how many 


<blockquote>

GHz it runs at. Sure 
</blockquote>

</blockquote>

you can get 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

great MB or GB/s with 
</blockquote>

</blockquote>

</blockquote>

SATA disk but 


<blockquote>

I'd 
</blockquote>

</blockquote>

reckon that IOP performance is 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

abismal. If you 
</blockquote>

</blockquote>

</blockquote>

are 
</blockquote>

utilizing 


<blockquote>

<blockquote>

<blockquote>

GlusterFS without the cache pooling magic 


that 
</blockquote>

</blockquote>

</blockquote>

is 


<blockquote>

ZFS then I would 


<blockquote>

imagine that latency can be an issue. 
</blockquote>

</blockquote>




<blockquote>

<blockquote>

<blockquote>


</blockquote>


</blockquote>




<blockquote>

On Tue, Oct 23, 2012 at 
</blockquote>

7:56 AM, 
</blockquote>

Andreas Huser wrote: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>


Hi Fabrice, 


<blockquote>


i 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

know 


<blockquote>

<blockquote>

<blockquote>

OpenSolaris/Solaris Oracle it's so 
</blockquote>

</blockquote>

a thing. 


<blockquote>

<blockquote>

<blockquote>

I'm 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

for more then 10 


<blockquote>

<blockquote>

<blockquote>

years a open source user and 
</blockquote>

</blockquote>

that 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

with oracle - i did no like at the 


<blockquote>

<blockquote>

<blockquote>

beginning of this 
</blockquote>

</blockquote>


</blockquote>

constallation. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

But Oracle makes his work good i 
</blockquote>

</blockquote>

know 
</blockquote>

</blockquote>

</blockquote>

that. The 


<blockquote>

cost of one socket 


<blockquote>

<blockquote>

<blockquote>

are 700$ and you can use so 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

much 


<blockquote>

<blockquote>

<blockquote>

quantity of 
</blockquote>

</blockquote>

TB as you will. 


<blockquote>

<blockquote>

<blockquote>

And you can use the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

full premier Support 


<blockquote>

<blockquote>

<blockquote>

from 
</blockquote>

</blockquote>

Oracle. 


<blockquote>

<blockquote>

<blockquote>

Nexenta develop with 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

the Illumos code. And the Licence 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

are TB based. 


<blockquote>

<blockquote>

<blockquote>

That 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

is not my favorite. As well the pool version 


<blockquote>

from 


<blockquote>

Nexenta 
</blockquote>

</blockquote>

comes 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

not after. Current Nexenta Infiniband are 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

not a 
</blockquote>

usable 


<blockquote>

<blockquote>

<blockquote>

solution. 


<blockquote>

But every can use what he will. 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Everyone 


<blockquote>

must decide for 


<blockquote>

themselves. 


<blockquote>


SRP 
</blockquote>

</blockquote>

</blockquote>

Targets or iser are not 


<blockquote>

difficult to configure. 


<blockquote>

Use the SRP 
</blockquote>

</blockquote>

for 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

the Storage unit 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

connection. Solaris and GlusterFS 


<blockquote>


</blockquote>

</blockquote>

builds one Storage unit. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

The 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

GlusterFS Server export the 
</blockquote>

final Volume 


<blockquote>

<blockquote>

<blockquote>

to the Clients as well 
</blockquote>

</blockquote>

KVM, 


<blockquote>

<blockquote>

<blockquote>

VMWare, 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Hyper-V etc. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

You can use 
</blockquote>

</blockquote>

native GlusterFS, 
</blockquote>

</blockquote>

RDMA, NFS 
</blockquote>

ore CIFS to export the Volume. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

SRP have 
</blockquote>

</blockquote>

nothing to do 
</blockquote>

</blockquote>


</blockquote>

with VmWare. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


When you are use a 7200 SAS drive the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>


access 
time are the same as a 


<blockquote>

<blockquote>

<blockquote>

SATA drive only the quality 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

of the 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

hardware are better. When you need 


<blockquote>

<blockquote>

<blockquote>

Performance 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

you must use SAS 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

drives with 15000U/m. But it's not needed 



<blockquote>

when you install SSD 
</blockquote>

for 


<blockquote>

ZIL/L2ARC. ZeusRAM rocks :-) 
</blockquote>




<blockquote>




<blockquote>

I use dedup only at 
</blockquote>

</blockquote>

secondary stroage 


<blockquote>

or on 
</blockquote>

</blockquote>

Backupserver not on Primary 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Storage. 
</blockquote>

</blockquote>


When you use 
</blockquote>

</blockquote>

</blockquote>

SSD SATA 


<blockquote>

<blockquote>

<blockquote>

drives then you have an cheap an fast 
</blockquote>

</blockquote>

strorage. 


<blockquote>


</blockquote>

1TB drive cost unter 


<blockquote>

100$. Current i'm not need to 
</blockquote>

save 
</blockquote>

storage volume. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


Which 
</blockquote>

</blockquote>

application use atime? I 
</blockquote>

</blockquote>

</blockquote>

know 


<blockquote>

"find -atime N". atime on a Storage 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

that stores 
</blockquote>

</blockquote>

only Virtual 


<blockquote>

Disks? 


<blockquote>

<blockquote>

<blockquote>

I don't need the information when i 
</blockquote>

</blockquote>

</blockquote>


have last opening 
the disk :-) 


<blockquote>

<blockquote>

<blockquote>


For a Solaris GUI 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

take a look at 


<blockquote>

<blockquote>

<blockquote>

napp-it 

http://www.napp-it.org/ 
</blockquote>

</blockquote>




<blockquote>

<blockquote>

<blockquote>

greetings from germany 
</blockquote>

</blockquote>


</blockquote>

Andreas 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>



----- Ursprüngliche Mail ----- 
</blockquote>


</blockquote>





<blockquote>

<blockquote>

<blockquote>

Von: 
</blockquote>

</blockquote>

"Fabrice Brazier">> An: 
</blockquote>


</blockquote>

cloudstack-users@incubator.apache.org 
< mailto:cloudstack-users@incubator.apache.org > 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Gesendet: Dienstag, 
</blockquote>

</blockquote>

</blockquote>

23. 


<blockquote>

Oktober 2012 12:30:50 


<blockquote>

<blockquote>

<blockquote>

Betreff: RE: Primary 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Storage 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Hi Andreas, 


<blockquote>

<blockquote>

<blockquote>


Hum that's 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

pretty cool, i know they have 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

still trouble with infiniband 
</blockquote>

on 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

nexenta but it's clearly a 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

priority 


<blockquote>

on their 
</blockquote>

</blockquote>

roadmap (plus i trust more the 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

nexenta team 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

to evolve zfs 

than oracle). 


<blockquote>

<blockquote>

<blockquote>


I agree iscsi over infiniband 
</blockquote>

</blockquote>

</blockquote>


</blockquote>

increase the latency 


<blockquote>

<blockquote>

<blockquote>

but most of the time 


<blockquote>

it's 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

just more 


<blockquote>

simple to use IPOIB than IB, 


<blockquote>

for example if you use 
</blockquote>

</blockquote>

vmware the 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

support of infiniband it's assured 


<blockquote>

by 
</blockquote>

</blockquote>

mellanox and not vmware so if 


<blockquote>

you 


<blockquote>

<blockquote>

<blockquote>

have 
an issue the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

vmware 


<blockquote>

<blockquote>

<blockquote>

support probably don't help 
</blockquote>

</blockquote>

you anymore. 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>


</blockquote>

About the raid i'm a fan 


<blockquote>

of raid 10, i 
</blockquote>

prefer build a raid 
</blockquote>

10 with 7200 sas 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

drive than 
</blockquote>

</blockquote>

multiple vdev 
</blockquote>

</blockquote>

raidz/raidz2 
</blockquote>

15k sas drive. Particularly for 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

virtual 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

environment 
</blockquote>

with a random workload and multiple access. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


I'm 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

a 
</blockquote>

fan 


<blockquote>

<blockquote>

<blockquote>

of nfs so agreed about the zil, and zeus ram are the only 
</blockquote>

</blockquote>

</blockquote>

one 


<blockquote>

for 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

that (with classical ssd you can hit the write 
</blockquote>

</blockquote>

hole). 


<blockquote>

<blockquote>

<blockquote>





<blockquote>

Agreed for 
</blockquote>

compression too (but only 
</blockquote>

</blockquote>

</blockquote>

lzjb level gzip use too 


<blockquote>

much cpu), 


<blockquote>

<blockquote>

<blockquote>

disable 
</blockquote>

</blockquote>

atime 
</blockquote>

</blockquote>

permit to decrease the iops load but 


<blockquote>

i'm not sure is really a 


<blockquote>


</blockquote>




<blockquote>

best practice. About the deduplication 
</blockquote>

I don't totally 
</blockquote>

agree. It 


<blockquote>

<blockquote>

<blockquote>

really 


<blockquote>

depends of your array, the 
</blockquote>

</blockquote>

</blockquote>

workload 
</blockquote>

and the VM type on your 


<blockquote>

<blockquote>

<blockquote>

cloudstack. 


<blockquote>

Actually when i 
</blockquote>

</blockquote>

</blockquote>


</blockquote>

build a zfs array i count 1Gb ram for 1Tb 


<blockquote>

<blockquote>

<blockquote>

disks. With 


<blockquote>


</blockquote>

</blockquote>

</blockquote>


</blockquote>

deduplication i count 2Gb ram for 1 Tb disks (plus in 


<blockquote>

<blockquote>

<blockquote>

nexenta 4 
</blockquote>

</blockquote>

</blockquote>

they 


<blockquote>

will 


<blockquote>

<blockquote>

<blockquote>

add a possibility to use ssd for the 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

deduplication metada, 


<blockquote>

like the l2arc 


<blockquote>

<blockquote>

<blockquote>

for the read cache). 
</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>

The 
</blockquote>

last point 
</blockquote>

it's about your drives, why sata 
</blockquote>

drives? I mean sata 


<blockquote>

<blockquote>

<blockquote>

doesn't 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

respect their latency, and 
</blockquote>

the reliability of a sas drive is 


<blockquote>

<blockquote>

<blockquote>

10x the 
</blockquote>

</blockquote>

sata. 


<blockquote>

<blockquote>

<blockquote>

Plus 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

now you can found many NL-sas drive at low 


<blockquote>

<blockquote>

<blockquote>

cost 



</blockquote>

</blockquote>

But it's really a nice architecture, i never try glusterfs 


<blockquote>


</blockquote>


</blockquote>

(for the moment) 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

plus in that case it's really a good way to 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

have 


<blockquote>

a 


<blockquote>

replacement to a 


<blockquote>

metro-cluster for free, i try 
</blockquote>

</blockquote>

</blockquote>

one of is 


<blockquote>

competitor 


<blockquote>

(onefs) and clearly 


<blockquote>

clustered 
</blockquote>

</blockquote>

</blockquote>

filesystem are the 


<blockquote>

futur. 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Cheers, 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

Fabrice 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


-----Message 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

d'origine----- 


<blockquote>

<blockquote>

<blockquote>

De : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Andreas 


<blockquote>

<blockquote>

<blockquote>

Huser 
</blockquote>

</blockquote>

[ mailto:ahuser@7five-edv.de ] 


<blockquote>

<blockquote>

<blockquote>

Envoyé : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

mardi 23 octobre 2012 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

11:37 


<blockquote>

<blockquote>

<blockquote>

À : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Objet : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Re: 


<blockquote>


</blockquote>

</blockquote>

Primary Storage 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


Hi Fabrice, 

I don't 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

know 


<blockquote>

what other people 


<blockquote>

do but i have no problems with 
</blockquote>

</blockquote>

Infiniband + 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

GlusterFS + Cloudstack 


<blockquote>

I'm not use 
</blockquote>

</blockquote>

Nexenta it's based on Illumos and 


<blockquote>

work 


<blockquote>

<blockquote>

<blockquote>

not fine with 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Infiniband. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

I have two different 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

clusters in productiv 


<blockquote>


</blockquote>

</blockquote>

enviroments. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


The first: Solaris 11 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

with built-in 
</blockquote>

GlusterFS 3.3 


<blockquote>

<blockquote>

<blockquote>

export Gluster Vol. with 


<blockquote>

RDMA. > 
</blockquote>

</blockquote>

</blockquote>


</blockquote>

performance is okey you can use that 


<blockquote>

<blockquote>

<blockquote>

for smaller enviroments 
</blockquote>

</blockquote>

</blockquote>

The 


<blockquote>

<blockquote>

<blockquote>


Second: Is a little bit complex with a GlusterFS 
</blockquote>

</blockquote>

</blockquote>

Server in the 


<blockquote>

middle 


<blockquote>

<blockquote>

<blockquote>


ZFS Server: based on Solaris 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

11 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

1.) Create a 
</blockquote>

</blockquote>

zpool min. two vdevs and SSD 
</blockquote>

read/write cache 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

2.) Create 
</blockquote>

</blockquote>

a thin 
</blockquote>

</blockquote>

provisioning Volume 
</blockquote>

"zfs create -V" disable atime and 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

enable 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

compression 
</blockquote>

(do not enable dedup!) and export as (iWarp) SRP Target 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>


to the GlusterFS Server use a direct connection without IB 
</blockquote>

Switch. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

GlusterFS Server: 


<blockquote>

1.) Use 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

RedHat, CentOS or 


<blockquote>

Fedora (I use CentOS 5 


<blockquote>

and 6) 


<blockquote>

2.) 
</blockquote>

</blockquote>

</blockquote>

Use OFED driver from 


<blockquote>

https://www.openfabrics.org 


<blockquote>

<blockquote>

<blockquote>

3.) 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Import the SRP Target from ZFS 


<blockquote>

Server and format as xfs 


<blockquote>

<blockquote>

<blockquote>

4.) 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Create a 


<blockquote>

<blockquote>

<blockquote>

Glustervolume "volume 
</blockquote>

</blockquote>

create xy transport rdma " (use 
</blockquote>

only 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

rdma) 
</blockquote>

</blockquote>

5.) Connect 
</blockquote>

</blockquote>

with the second IB 
</blockquote>

Port to a IB Switch 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


Cloudstack 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Hypervisor 
</blockquote>

Node: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

1.) Use RedHat, CentOS or Fedora (I use CentOS 5 
</blockquote>

</blockquote>

</blockquote>

</blockquote>


</blockquote>

and 


<blockquote>

<blockquote>

<blockquote>

6) 


<blockquote>

2.) Use OFED driver from 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

https://www.openfabrics.org 


<blockquote>

<blockquote>

<blockquote>


3.) Import 
the 
</blockquote>

</blockquote>

</blockquote>

Glustervolume 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


ZFS Thin Volume ---- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Infinband SRP 
</blockquote>

----> 


<blockquote>

<blockquote>

<blockquote>

GlusterFS ---- GFSVol rdma ----> 


<blockquote>

IB Switch 
</blockquote>

</blockquote>

</blockquote>

----> 
</blockquote>

Clients 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


The 
</blockquote>

</blockquote>

ZFS and GlusterFS Server form a 
</blockquote>

</blockquote>

</blockquote>

Storage 


<blockquote>

unit connect direct with 


<blockquote>

40Gbit 


<blockquote>

Infiniband 
</blockquote>

</blockquote>

</blockquote>

Point-to-Point You 


<blockquote>

do not feel that is a cable 


<blockquote>

between! 
</blockquote>




<blockquote>

<blockquote>

<blockquote>

Important: When you 
</blockquote>

</blockquote>

</blockquote>

have Infiniband use not IPoIB with 

iSCSI! If one 


<blockquote>

<blockquote>

<blockquote>

already 
</blockquote>

</blockquote>

</blockquote>

has Infiniband then you 
</blockquote>

should also use the 


<blockquote>

<blockquote>

<blockquote>

advantage. 


<blockquote>

IPoIB 
</blockquote>

</blockquote>

</blockquote>

have a higher 
</blockquote>

Latency as iWarp SRP! 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>



SRP = 
</blockquote>

</blockquote>

usec 
</blockquote>

</blockquote>





<blockquote>




<blockquote>

-- SRP -- 
local address: LID 0x01 QPN 
</blockquote>

</blockquote>

</blockquote>

0x44004b PSN 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

0xf3265b RKey 0x9804237c VAddr 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

0x00000001dda000 remote address: 


<blockquote>

LID 


<blockquote>

0x0a QPN 0x10004a PSN 
</blockquote>

</blockquote>

0x44072e RKey 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

0x1c0f115 VAddr 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

0x000000088e6000 


<blockquote>


</blockquote>




<blockquote>


</blockquote>


</blockquote>

------------------------------------------------------------------ 


<blockquote>





<blockquote>


#bytes #iterations t_min[usec] t_max[usec] 
</blockquote>

</blockquote>

t_typical[usec] 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

2 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

1000 


<blockquote>

1.29 125.30 1.31 


<blockquote>


</blockquote>

</blockquote>




</blockquote>

------------------------------------------------------------------ 


<blockquote>





<blockquote>




<blockquote>


</blockquote>

-- IPoIB --- 


<blockquote>

[root@sv01sfogaa ~]# ping 
</blockquote>

</blockquote>

</blockquote>

10.200.0.10 


<blockquote>

<blockquote>

<blockquote>


PING 
10.200.0.10 (10.200.0.10) 56(84) 
</blockquote>

</blockquote>

</blockquote>

bytes of data. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

64 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

bytes from 


<blockquote>

10.200.0.10: icmp_seq=1 
</blockquote>

</blockquote>

ttl=255 time=0.147 ms 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

64 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

bytes from 


<blockquote>

10.200.0.10: 
</blockquote>

</blockquote>

icmp_seq=2 ttl=255 time=0.116 ms 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

When you put load 
</blockquote>

on 


<blockquote>

<blockquote>

<blockquote>

IPoIB then increases the latency that is not 
</blockquote>

</blockquote>

good 


<blockquote>


</blockquote>




<blockquote>

<blockquote>

<blockquote>


That is my 
</blockquote>

</blockquote>

recommendation for a simple 
</blockquote>


</blockquote>

GlusterFS Mirror: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


- Supermicro Server 
</blockquote>

</blockquote>

with Intel 
</blockquote>

</blockquote>

</blockquote>

Hardware 


<blockquote>

and Expander Backplane 


<blockquote>

<blockquote>

<blockquote>

- 1x Crucal-M4 SSD Read 
</blockquote>

</blockquote>

</blockquote>


Cache 


<blockquote>

<blockquote>

<blockquote>

- 2x 
</blockquote>

</blockquote>

</blockquote>

ZeusIOPs SSD Write cache (mirror) 


<blockquote>

<blockquote>

<blockquote>

- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

SATA 24/7 


<blockquote>

<blockquote>

<blockquote>

Harddrives 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

- LSI HBA 9207 or 9211 


<blockquote>

<blockquote>

<blockquote>

- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

ConnectX-2 QDR Dualport 


<blockquote>

<blockquote>

<blockquote>

Infiniband 
</blockquote>

</blockquote>

Adaper (HP refurbished with 
</blockquote>

full 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

warrany for 100$) 
</blockquote>

</blockquote>

Importent > 
</blockquote>

</blockquote>

Flash newest 
</blockquote>

Firmware from Mellanox! 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

- Mellanox IB 
</blockquote>

</blockquote>

Swich 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

- 
</blockquote>

Solaris 11 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

- GlusterFS 3.3 compiled with ib_verbs 
- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>



Gluster Volume transport only rdma 


<blockquote>

<blockquote>

<blockquote>




</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>


</blockquote>





Throughput 
constant up to 200 
</blockquote>

</blockquote>

Mbyte/s >> more throughput with 


<blockquote>

more storage servers 


<blockquote>

or more 
</blockquote>

</blockquote>

harddrives on jbod >> >> >> 


<blockquote>

<blockquote>

<blockquote>




<blockquote>


</blockquote>





<blockquote>

Info: 
</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>

- I have some 
</blockquote>

problems with infiniband 
</blockquote>

RDMA or 
</blockquote>

SRP with OpenIndiana or 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Illumos or 
</blockquote>

</blockquote>

Nexenta. Some 
</blockquote>

</blockquote>


</blockquote>

adapters have a high latency or not a stable 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

connection. 
</blockquote>

</blockquote>

</blockquote>

Use 


<blockquote>

Solaris that's the right way! 


<blockquote>

<blockquote>

<blockquote>

- OpenIndiana is Beta! 
</blockquote>

</blockquote>

</blockquote>


Infiniband 
ib_verbs works not or not fine! 


<blockquote>

<blockquote>

<blockquote>

- Use Solaris 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

11 


<blockquote>

<blockquote>

<blockquote>

Infiniband 
</blockquote>

</blockquote>

ib_verbs are native and stable 


<blockquote>

<blockquote>

<blockquote>

- Don't 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

use Ubuntu Client 


<blockquote>

<blockquote>

<blockquote>

Server 
</blockquote>

</blockquote>

for Infiniband! Use RedHat, Fedora 
</blockquote>

or 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

CentOS and install the 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

right drivers from 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

https://www.openfabrics.org/downloads/OFED/ 


<blockquote>

<blockquote>

<blockquote>


- 
You have 
</blockquote>

</blockquote>

</blockquote>

not SSD Cache? Disable at the ZFS Volume the sync! 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>


</blockquote>

Important! 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

You 
loose Security for your Data but 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

some 


<blockquote>

protocols use 


<blockquote>

sync flags in 


<blockquote>

transport. 
</blockquote>

</blockquote>

</blockquote>

Example NFS use by 


<blockquote>

default fsync. write 


<blockquote>

cache are not active. 
</blockquote>

</blockquote>

NFS writes 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

data 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

direct to the Harddrive. For 


<blockquote>

Data 
</blockquote>

</blockquote>

Security and Performance give 


<blockquote>

the 


<blockquote>

<blockquote>

<blockquote>

Storage Server a SSD 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

write 


<blockquote>

<blockquote>

<blockquote>

cache. ZFS works at default with 
</blockquote>

</blockquote>

sync=standard 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

that prevent write 


<blockquote>

<blockquote>

<blockquote>

holes. (COW System) 


<blockquote>


</blockquote>


I hope 
</blockquote>

</blockquote>

</blockquote>

that I could help a little 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Greeting from 
</blockquote>

</blockquote>


</blockquote>

Germany 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Andreas 



----- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

Ursprüngliche 


<blockquote>

Mail 


<blockquote>

----- 


<blockquote>


Von: "Fabrice 
</blockquote>

</blockquote>

</blockquote>

Brazier" 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

An: 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 



<blockquote>

Gesendet: Dienstag, 23. 
</blockquote>

Oktober 


<blockquote>

2012 09:55:15 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

Betreff: RE: Primary Storage 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Hi Andreas, 


<blockquote>


</blockquote>




<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Hello i just see your configuration, it seems 
</blockquote>

quite 
</blockquote>

interesting. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

If i 
</blockquote>

</blockquote>

understand well you want to build 
</blockquote>

</blockquote>

</blockquote>

some 


<blockquote>

zfs array on the backend. 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Export luns (probably 
</blockquote>

</blockquote>

by iscsi over 


<blockquote>

infiniband) to you linux cluster, 


<blockquote>

and 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

on 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

the linux 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

cluster you put glusterFS. 


<blockquote>

<blockquote>

<blockquote>

I can 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

understand 


<blockquote>

<blockquote>

<blockquote>

the point, with 
</blockquote>

</blockquote>

that you can have very good 
</blockquote>

performance 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

and 
</blockquote>

</blockquote>

reliability 
</blockquote>

</blockquote>

(zfs), 
</blockquote>

scalability and redundancy (gluster) for very low 


<blockquote>

<blockquote>

<blockquote>

cost. 
</blockquote>

</blockquote>


So just one question, did you try the global namespace 


<blockquote>


</blockquote>

</blockquote>

implementation 


<blockquote>

from 


<blockquote>

<blockquote>

<blockquote>

nexenta? 
If yes can you tell me 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

what 


<blockquote>

<blockquote>

<blockquote>

configuration 
</blockquote>

</blockquote>

is the best for you? 


<blockquote>

<blockquote>

<blockquote>

I mean the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

fact you have a gluster 


<blockquote>

<blockquote>

<blockquote>

cluster 
</blockquote>

</blockquote>

in the middle must impact 
</blockquote>

the 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

overral performance no? 
</blockquote>

</blockquote>




</blockquote>

</blockquote>

</blockquote>

Fabrice 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


-----Message d'origine----- 
De : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

Andreas 


<blockquote>

Huser 


<blockquote>

[ mailto:ahuser@7five-edv.de ] Envoyé : mardi 23 
</blockquote>

</blockquote>

octobre 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

2012 05:40 À : 


<blockquote>


</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Objet : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Re: Primary 

Storage 


<blockquote>

<blockquote>

<blockquote>


Hi, 

for Cloudstack i 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

use 


<blockquote>

Solaris 11 ZFS + GlusterFS 


<blockquote>

over Infiniband (RDMA). 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

That 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

gives the best performance and most 


<blockquote>

scalable 
</blockquote>

</blockquote>

Storage. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

I have 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

tasted some different solutions for 
</blockquote>

primary 


<blockquote>

<blockquote>

<blockquote>

Storage but the most 
</blockquote>

</blockquote>

are 


<blockquote>

<blockquote>

<blockquote>

to expensive and for 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

a CloudStack Cluster 


<blockquote>

<blockquote>

<blockquote>

not economic or 
</blockquote>

</blockquote>

have a poor 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

performance. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


My Configuration: 
</blockquote>

</blockquote>


</blockquote>

</blockquote>


</blockquote>

Storage Node: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Supermicro Server (Intel Hardware) with Solaris 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

11 


<blockquote>

with 


<blockquote>

SSD write and read 


<blockquote>

cache (read crucial-m4, 
</blockquote>

</blockquote>

</blockquote>

write ZeusIOPS) 


<blockquote>

GlusterFS 


<blockquote>

and dualport ConnectX 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

40Gbit/s Infiniband adapter. 


<blockquote>

<blockquote>

<blockquote>





<blockquote>

I have 
</blockquote>

installed 
</blockquote>

</blockquote>

</blockquote>

GlusterFS direct on Solaris with a modified 


<blockquote>

code. 


<blockquote>

<blockquote>

<blockquote>

Want 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

you 


<blockquote>

<blockquote>

<blockquote>

build bigger systems for more then 50 VMs it is 
</blockquote>

</blockquote>

better you 
</blockquote>

split 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

the 
</blockquote>

</blockquote>

Solaris and GlusterFS with a separte 
</blockquote>

</blockquote>


</blockquote>

headnode for GlusterFS 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


That 
</blockquote>

</blockquote>

looks like: 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Solaris ZFS 


<blockquote>

Backendstorage with a dataset Volume (Thin 


<blockquote>


</blockquote>

</blockquote>

Provision) --> ( SRP 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Target attached direct without 
</blockquote>

Infiniband switch 


<blockquote>

<blockquote>

<blockquote>

to GF Node) --> 
</blockquote>

</blockquote>

GlusterFS 


<blockquote>

<blockquote>

<blockquote>

Node the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

srp target formatted with xfs 


<blockquote>

<blockquote>

<blockquote>

filesystem, 
</blockquote>

</blockquote>

create a GlusterFS 



<blockquote>

Volume --> ( Infiniband over a 
Mellanox Port 
</blockquote>

Switch) --> 
</blockquote>

Cloudstack Node 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

mount glusterFS Volume over 
</blockquote>

</blockquote>

RDMA 
</blockquote>

</blockquote>





<blockquote>

<blockquote>

<blockquote>

For the Dataset Volume at the ZFS Storage, disable atime 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

and 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

enable 


<blockquote>

<blockquote>

<blockquote>

compression. 
(Space reclaim) With 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

compression you 


<blockquote>

can 


<blockquote>

shrink the ZFS Volume with command 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

at Linux dd /dev/zero or 


<blockquote>

In a 


<blockquote>

Windows VM with sdelete That 
</blockquote>

</blockquote>

gives you space 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

left on the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Primary 


<blockquote>

Storage for 
</blockquote>

</blockquote>

deleted Files in a VM or for deleted vhd's 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

or vm's in the 

cloudstack 


<blockquote>

<blockquote>

<blockquote>


greeting Andreas 


</blockquote>

</blockquote>

</blockquote>





<blockquote>




<blockquote>


Mit freundlichen 
</blockquote>

Grüßen 


<blockquote>


</blockquote>


</blockquote>

Andreas Huser 


<blockquote>


Geschäftsführer 


<blockquote>

System Engineer 
</blockquote>

</blockquote>

</blockquote>

/ 


<blockquote>

<blockquote>

<blockquote>

Consultant 


<blockquote>

(Cisco 
</blockquote>

</blockquote>

</blockquote>

CSE, SMBAM, LCSE, ASAM) 


<blockquote>


</blockquote>




<blockquote>


</blockquote>

--------------------------------------- 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Zellerstraße 28 - 77654 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Offenburg 


<blockquote>

<blockquote>

<blockquote>

Tel: +49(781) 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

12786898 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Mobil: +49(176) 10308549 
</blockquote>

</blockquote>




</blockquote>

</blockquote>

</blockquote>

ahuser@7five-edv.de 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>






</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

----- 


<blockquote>

Ursprüngliche Mail ----- 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Von: "Outback 
</blockquote>

</blockquote>

Dingo" 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

An: 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>


Gesendet: Dienstag, 23. 
Oktober 2012 02:15:16 


<blockquote>

<blockquote>

<blockquote>

Betreff: 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Re: Primary 


<blockquote>

<blockquote>

<blockquote>

Storage 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

On Mon, Oct 22, 2012 at 
</blockquote>

8:09 PM, Ivan 


<blockquote>

<blockquote>

<blockquote>

Rodriguezwrote: 


<blockquote>


</blockquote>






<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

Solaris 11 ZFS and yes we tried different 


<blockquote>

<blockquote>

<blockquote>

setups, raids 
</blockquote>

</blockquote>

levels 
</blockquote>

number >> of SSD cache, ARC zfs options etc etc 


<blockquote>

<blockquote>

<blockquote>

etc. >> >> 
</blockquote>

</blockquote>


</blockquote>

Cheers >> >> >> VMWare ?? >> 


<blockquote>

<blockquote>

<blockquote>




<blockquote>




</blockquote>


</blockquote>

</blockquote>




<blockquote>


</blockquote>


</blockquote>




-- 
Regards, 

Nik 

Nik Martin 
VP Business Development 
Nfina Technologies, Inc. 
+1.251.243.0043 x1003 
Relentless Reliability 

</blockquote>



Mime
View raw message