Return-Path: X-Original-To: apmail-incubator-cloudstack-users-archive@minotaur.apache.org Delivered-To: apmail-incubator-cloudstack-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 86607DD18 for ; Wed, 24 Oct 2012 18:21:09 +0000 (UTC) Received: (qmail 65625 invoked by uid 500); 24 Oct 2012 18:21:09 -0000 Delivered-To: apmail-incubator-cloudstack-users-archive@incubator.apache.org Received: (qmail 65597 invoked by uid 500); 24 Oct 2012 18:21:09 -0000 Mailing-List: contact cloudstack-users-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cloudstack-users@incubator.apache.org Delivered-To: mailing list cloudstack-users@incubator.apache.org Received: (qmail 65589 invoked by uid 99); 24 Oct 2012 18:21:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 18:21:09 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,LOTS_OF_MONEY,SPF_PASS,T_MONEY_PERCENT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cweise@iswest.net designates 207.178.128.122 as permitted sender) Received: from [207.178.128.122] (HELO agcex01.CORP.ISWEST.NET) (207.178.128.122) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 18:20:45 +0000 Received: from AGCEX01.CORP.ISWEST.NET ([fe80::d81:d08c:4036:401d]) by agcex01.CORP.ISWEST.NET ([fe80::d81:d08c:4036:401d%11]) with mapi id 14.01.0218.012; Wed, 24 Oct 2012 11:20:11 -0700 From: Clayton Weise To: "cloudstack-users@incubator.apache.org" Subject: RE: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT Thread-Topic: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT Thread-Index: AQHNsZzBMrB0q3imR0Kb/Gj9waj1EJfIi3qAgABdzoCAAAYHAIAAGvIA//+4pbA= Date: Wed, 24 Oct 2012 18:20:10 +0000 Message-ID: References: <4ddf01e98f600e9677bf953a25bce8f6@mail.gmail.com> <202e9aca72e9d464ed1553913dd706b5@v365.com.au> <3715206fd36a83efd24aa2f04fc50d04@v365.com.au> <58606c135747e29d0d9014762bf8be7f@mail.gmail.com> <1DC1B34F-723A-4EF5-89AD-B44F9C1F2790@tgrahamcapital.com> <5087F242.1080000@nfinausa.com> <9448E7A9-591D-4B3C-985B-8F1A7531AA31@tgrahamcapital.com> In-Reply-To: <9448E7A9-591D-4B3C-985B-8F1A7531AA31@tgrahamcapital.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-originating-ip: [192.168.254.157] Content-Type: multipart/related; boundary="_005_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_"; type="multipart/alternative" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_005_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_ Content-Type: multipart/alternative; boundary="_000_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_" --_000_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable You're going to see a dip in performance with KVM primarily because of the = qcow2 format (KVM people please correct me if I'm wrong because I'm new to = KVM). Due to time constraints I wasn't able to run any tests with writebac= k cache turned on or off but I've seen people saying both in forums that ha= ving it on improves performance, and having it off improves performance to = who knows. What I can tell you is that in my tests with our current setup I was able t= o achieve about ~4-5k IOPS for random reads/writes using a multitude of dif= ferent benchmark utilities in both Linux and Windows with KVM's raw disk fo= rmat. When I switched over to qcow2 it went down to about 2-2.5k. With th= e way our test RAID system is constructed our theoretical max is about 5k I= OPS so I'm not sure if the drop is something related to the overall potenti= al (e.g. 50% of the total) or simply a hard cap of that number but I haven'= t had a chance to really kick things around much. This test was done using= an iSCSI LUN mounted from a Compellent SAN on 10gbit Ethernet and OCFS2 on= top of it. I know there was a comment earlier saying that OCFS2 was old b= ut what I can say about it is that it was extremely easy to configure and g= et up and running and seemed to be pretty solid when we tested it about a y= ear ago as well as the other week. Regarding your ZFS on Linux question. ZFS on Linux just isn't there yet. = It's too new and the work that's being done on it right now is primarily ar= ound stability and integrating features and not about performance so if you= 're going to do ZFS do it on either BSD or Solaris/Illumos. The alternativ= e to that is the hardware equivalent with something like LSI's cachecade wh= ich will do basically the same thing and it will work just fine with Linux. From: Trevor Francis [mailto:trevor.francis@tgrahamcapital.com] Sent: Wednesday, October 24, 2012 8:27 AM To: cloudstack-users@incubator.apache.org Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT So, since ZFS is super slow on Centos, would it be better to use ext4 and b= cache over the native linux NFS daemon? Also, I am reading the KVM is much more disk performant using writeback cac= hing. Does Cloudstack deploy the KVM VM instances with this enabled by defa= ult? Trevor Francis Partner 46 Labs | The PeerEdge Cloud http://www.46labs.com | http://www.peeredge.net 405-362-0046 - Voice | 405-410-4980 - Cell trevorgfrancis - Skype trevor@46labs.com Solutions Provider for the Telecom Industry [cid:image001.jpg@01CDB1D8.F6D21DB0] [cid:image002.jpg@01CDB1D8.F6D21DB0] On Oct 24, 2012, at 8:50 AM, Nik Martin > wrote: On 10/24/2012 08:29 AM, Trevor Francis wrote: Seems my original post has taken a detour. Who knew storage could be so controversial! Since we are a Centos house and it would be nice to have a standard OS across our platform, but many people have been singing ZFS' praises. To bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS for linux. Does anyone have experience with this? Would it be better to deploy Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or bcache? I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote= I got was "ZFS is slow on Linux". I then asked "Slow relative to what, ZFS= on Solaris, ext4 on Linux, ZFS on Linux?", and the response was "yes", so = I'm being led to believe that it does not perform as well as other filesyst= ems. For large file storage on Linux, which Xen VHDs are, I tend to use XF= S if the network protocol is NFS. Also, is the standard linux NFS daemon sufficient to handle a significant load or do you recommend a different NFS daemon....or go iSCSI = ? Our biggest concern is that we have most VMs doing many sequential small writes (log files). When I first started with CS on Xenserver in January, I thought thin provis= ioning was important, so I built file system based storage using NFS, with = XFS as the underlying filesystem, on LVM. NFS (v3) on Centos 6 became proc= essor bound VERY quickly, running %300 CPU load on a quad core Xeon. I switched to iSCSI with LVM as backing store, and havwe had zero issues, e= xcept I'm currently investigating lower than desired random read/write on <= 4k random. Using tests like Iozone in a VM, and running iostat -t1 on the= hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it = drops to 300 with < 4k blocks of data. I expect to resolve this soon though= . Thoughts before I make a decision? Trevor Francis Partner 46 Labs | The PeerEdge Cloud http://www.46labs.com | http://www.peeredge.net 405-362-0046 - Voice | 405-410-4980 - Cell trevorgfrancis - Skype trevor@46labs.com Solutions Provider for the Telecom Industry On Oct 24, 2012, at 2:53 AM, Fabrice Brazier > wrote: Hi, @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i or the 8i mean 8 internals ports (so with 2 lsi you can have only 16 disks). After that a velociraptor disk it's only a 10k discs so we can count 120 IOPs by disks. If we put them in a raid 5 (by the way everybody know that raid 5 don't scale very well after 12 disk the performance doesn't increase so much but nevermind you built your raid 5 of 30 disks). So you have a maximum of (30-1)*120=3D3480 iops for read and ((30-1)*120)/5=3D700 iops write (in a 100% random profile) And you have onl= y 32Gb ram wich is peanuts to caching your read iops. So the only way you have to make performance is your cahecade. Now a cachecade is made of ssd (weird actually it's the same as zil and l2arc in zfs xD) plus the max size of a cachecade and the maximum size of a cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade max. So you say your storage doesn't cost anything but you don't talk about the size and the price of your cachecade. And another thing the day you will have a real workload (not only 50 Vms...) and this workload will be larger than 1Tb come again and explain us your very large performance. Me too I can built a zfs server with 4 zeusram in zil and 5 disks behind and make a benchmark on files of 2Gb and say hey see I have 500 000 Iops and 4Gb of throughput... Regards -----Message d'origine----- De : bruce.m@v365.com.au [mailto:bruce.m@v365.com.au ] Envoy=E9 : mercredi 24 octobre 2012 06:04 =C0 : cloudstack-users@incubator.apache.org Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT These Sans we have build have saved us a fortune $ compared to FC Block I/O Sans from IBM, HP etc Ive been a contractor for IBM and HP for 20 years. I know how much they are $$$$$ and what they can do. A Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your lucky you may get 100,000 perhaps 150,000 IOPS What ive build is better. cheaper, and frankly as i support Linux and SCST, much more reliable. the WD VR 1TB we get new for $310 so 30 x =3D$9,300 LSI 92808-8i x 2 $1800 ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam $800, Dual PSU $650, about $14,000 for a much superior SAN. As i explained another $15K for infiniband network. which does SRP and IP As for your issues with cache misses, the CacheCade doesn't Miss much from our tests. with WriteBack enabled, BBU enabled, and the SANs are on our primary 22KVA UPS It never fails. Also Its not doing very much unless a user in the cloud is hitting sequential blocks or R/W the same data repeatedly for some weird reason. Even then we see little lagg or slow down, when as you say its hitting the Disc. Infiniband with LSI Just rocks even more than we had thought! The R/W pattern from groups of 50 VM.s appear to be completely random, web servers, virtual firewalls, email servers, DNS etc, FTP, windows domains, sharepoint, MSSQL, MYSQL working on the same Lun which are not considered critical (as stated by clients) are each VM getting over 300-1200MB/s constantly even when other users are pulling or pushing a file copy @ 250Mb/s between VMs etc. Weve actually had to figure out how to slow down the SRP connections. If a user is provisioned with 2 Luns from each SAN, in windows you can copy between VM Drives @ 1220MB/s!!!!! faster than SSD in Linux its even faster, and that is in ESXi4.1 with 15~20 Servers per VM Host. we have snmp monitoring the 2 primary sans, and they never go above 30% bandwidth. so the point is the Cache has no real bearing on Cloud I/O. cloud IO is fairly low. We have LSI supplying us with the new PICE3 cards for a trial, so i expect even better performance. So long STORY SHORT. GET LSI AND 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s SSD's its SLOWER! MORE SPINDLES =3D MORE SPEED! the main reason we wanted fast IO is for backups, snapshots, so as not to interfere with the daily operational workings of the cloud VM guests. Which is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will work, and some smart ass does a restore or backup in prod hours, only to see the whole cluster slow down to a crawl. on iSCSI its even more pathetic. With SRP on LSI " WE DONT HAVE THAT PROBLEM!" I did a snapshot restore for 2 machines yesterday for a client, it took 2 mins to restore 100GB. And none of the other servers were affected. SNMP latency on all VM's no impact. Virtually 0CPU cycles on the VMware hosts for the IO move. IO latency/storage path for other VM's from snmp data from VMserver was 2ms which is normal. the machine doing the copy crept up to 10ms but no more. So my conclusion ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING! Cache misses we don't have a problem there at all. Were more worried about Data Rot, which is why were interested in GFS to ZFS Luns. But ZFS is still slow as compared to what we have..... we are going to launch a new script which moves a file full of 16-bit random io for 100MB, and 1TB, it will create a binary checksum for each file, and we can compare it over time to check for data rot. Just so you can appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts. the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started the 2nd, while the 1st was running, same results across the 2 VM's. then the 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to 220,000 per host. so yes there is a penalty. But the way infiniband works is that the switch balances the load across all servers. 4th server online same test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s @ 128K blocks 50%50% random seq. change the distribution to 100% random makes little difference. We confirmed all the drives were being accessed! all were. we even took a look at the camera out DC to monitor the array LEDS... alll Flickers like mad! we took a look at the metrics of the Megaraid as we got it to work on CentOS, no errors, no issues or problems, no dropped reads or writes. we set the file on IO meter as high as poss 900000000 units. We then fired up 2 more hosts and did real world file copy, no real impact still get 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent file copies between the 2 sans and each host blew us away at 2000MB/s read and write across all 6 hosts.!! FXXK we were amazed Aside from that LSI have done a super job on the Cache modules on their cards. I dont see how ZFS will ever approach this kind of performance. Anyway take from what ive said what you like. We have an awsum SAN... and its CHEAP and Amazingly fast! were in the process of developing a public Cloud, so when the web site is finn we will publish all of this in a whitepaper there. On 24.10.2012 10:57, Jason Davis wrote: Passionate are you? Take it easy I am not knocking on what you are doing. The term "spinning rust" is a catch all for hard drives, nothing personal at all, in fact the setup you have is impressive. Impressive until you have a cache miss. Cache misses are definitely an issue even in "capital E" enterprise storage. Big bad XIV disk arrays or 3par are impressive, fast arrays (crazy expensive ;]) when you are hitting DRAM or SSD but if you are running cache unfriendly workloads then ultimately your array is only as fast as the spinning disk sitting at the lowest tier. I agree that openstorage is the way to go, from a service providers perspective it's hard being competitive with AWS or RackSpace if you are using expensive traditional SAN. So anyway, my question wasn't a poke at what you are doing, just curious about how big of an IOP/latency drop you see once a read or write has to actually see the spinning disk. :) On Oct 23, 2012 8:10 PM, wrote: oh well if your negative, and make rude comments then that's your problem, i don't care about you. If you don't believe that RDMA can perform this fast then again, your problem. Ask anyone with a decent 12+ Disc Raid Array, and RDMA and they will tell you it will kick you in YOUR ballz! Anyone in storage knows to get iops and performance you need 12 preferably 16 spindles. the spinning rust as you call it, are the newest 200MB/s WD Veloci Raprors with 64MB Cache. @ 10KRPM 2.5" dics in 32 Bay Storage Chassis. we've used them in the past and they are as reliable as the SAS drives we use, but faster! RAID 10 IS BULLSHIT! Bloody slow and 50% goes up in smoke for nothing. All you get is the mirror speed of 2 drives which is barley 400MB/s tops! I Spit on RAID 10 We also tried Raid 60 and it was very good as well. But no point to use it as we replicated to 2nd SAN When we tested our new SANs, we failed 1 drive and set the rebuild rate to 30% on raid 6, with 40% scan rate. The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we removed a 2nd disc, and replaced with blank, and it was rebuild in another 4 hours. With no real impact to performance tests. Separating the underlying 6Gb/s Sata ports x 8 into effectively 16 x 3Gb/s Channels using a SAS Expander gives each Disc 300MB/s of bandwidth. They cant physically perform better than 200~220MB/s so there is enough bandwidth on the SATA and PCIE Bus to cope. LSI rate the 9280-8i @ 2500MB/s but it does work faster with Cache Cade. up to 3200MB/s in that test. So a real SAN has many spindles of high performance, and the WD VRs are better than some cheap SSD drives. RDMA is very fast, low cpu cycles, and reads and write directly to RAM@ 40Gb/s, we created a RAM drive and tested it up to 3200MB/s which is as fast as the PCIE 2 Bus / LSI could Handle. If you cant afford this realtivley cheap equipment (compated to IBM DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast performance. Any real Cloud would have what we have built. None of our customers, ALL 300 Of them! have lost a single bit of data, all VMs are running very quickly, and with no problems now for 3 months. So If you dont appreciate being shown how to build a kick ASS SAN then PISS OFF! On 24.10.2012 08:17, Jason Davis wrote: How often do you have folks cache missing and hitting the spinning rust below the cache cade tier? On Oct 23, 2012 6:25 PM, wrote: Hi with our SCST RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid Director 4036 switches + HP Gen2 QDR Chassis Switches. Our San achieves 295,000 - 350,000 Iops max per LSI controller, depending on block size for i/o meter test. LSI9280-8i with battery cache. + cachecade2.0 x 2 in PCIE 16x slot on supermicro motherboard. 1 x 250GB SSD for CACHE Cade + 16 1tb WD VR drives (16 - hot stand buy - 2 for raid 6 parity) =3D 2600MB/s most of our test are from 4K to 128K block size, @ 4K we get 295,000 Iops @ 1590 MB/s and @ 128K 350,000 iops @ 2400~2600MB/s we have tuned 64K and 128K block size on different luns. 64K for database and 128K for general file. The best thing about Infiniband is low CPU cycles. only 5~6% during these tests. latency is as low as 4-6ms average read time. concurrent response times are from 5-12ms. even under heavy load its below 20ms Infiniband latency is below 0.01 us Which is why we chose it. we run MSSQL on the 64K formatted Luns, its massively fast. If we copy the same data twice the cache cade kicks in and we achieve even better speeds. Ive compared this to block io @ 8Gb/s Fiber channel, and it barely gets 120,000 IOPS. @ much higher latency and bandwidth. So for our money RDMA wins! Bruce On 23.10.2012 21:20, Jason Davis wrote: Out of curiosity, is there any quick performance numbers for these ZFS + GlusterFS mashups you guys are talking about? Specifically, IOPs and latency? Sequential read/write performance honestly isn't a very good benchmark to determine your SANs performance. It's like comparing CPUs based solely on how many GHz it runs at. Sure you can get great MB or GB/s with SATA disk but I'd reckon that IOP performance is abismal. If you are utilizing GlusterFS without the cache pooling magic that is ZFS then I would imagine that latency can be an issue. On Tue, Oct 23, 2012 at 7:56 AM, Andreas Huser wrote: Hi Fabrice, i know OpenSolaris/Solaris Oracle it's so a thing. I'm for more then 10 years a open source user and that with oracle - i did no like at the beginning of this constallation. But Oracle makes his work good i know that. The cost of one socket are 700$ and you can use so much quantity of TB as you will. And you can use the full premier Support from Oracle. Nexenta develop with the Illumos code. And the Licence are TB based. That is not my favorite. As well the pool version from Nexenta comes not after. Current Nexenta Infiniband are not a usable solution. But every can use what he will. Everyone must decide for themselves. SRP Targets or iser are not difficult to configure. Use the SRP for the Storage unit connection. Solaris and GlusterFS builds one Storage unit. The GlusterFS Server export the final Volume to the Clients as well KVM, VMWare, Hyper-V etc. You can use native GlusterFS, RDMA, NFS ore CIFS to export the Volume. SRP have nothing to do with VmWare. When you are use a 7200 SAS drive the access time are the same as a SATA drive only the quality of the hardware are better. When you need Performance you must use SAS drives with 15000U/m. But it's not needed when you install SSD for ZIL/L2ARC. ZeusRAM rocks :-) I use dedup only at secondary stroage or on Backupserver not on Primary Storage. When you use SSD SATA drives then you have an cheap an fast strorage. 1TB drive cost unter 100$. Current i'm not need to save storage volume. Which application use atime? I know "find -atime N". atime on a Storage that stores only Virtual Disks? I don't need the information when i have last opening the disk :-) For a Solaris GUI take a look at napp-it http://www.napp-it.org/ greetings from germany Andreas ----- Urspr=FCngliche Mail ----- Von: "Fabrice Brazier">> An: cloudstack-users@incubator.apache.org Gesendet: Dienstag, 23. Oktober 2012 12:30:50 Betreff: RE: Primary Storage Hi Andreas, Hum that's pretty cool, i know they have still trouble with infiniband on nexenta but it's clearly a priority on their roadmap (plus i trust more the nexenta team to evolve zfs than oracle). I agree iscsi over infiniband increase the latency but most of the time it's just more simple to use IPOIB than IB, for example if you use vmware the support of infiniband it's assured by mellanox and not vmware so if you have an issue the vmware support probably don't help you anymore. About the raid i'm a fan of raid 10, i prefer build a raid 10 with 7200 sas drive than multiple vdev raidz/raidz2 15k sas drive. Particularly for virtual environment with a random workload and multiple access. I'm a fan of nfs so agreed about the zil, and zeus ram are the only one for that (with classical ssd you can hit the write hole). Agreed for compression too (but only lzjb level gzip use too much cpu), disable atime permit to decrease the iops load but i'm not sure is really a best practice. About the deduplication I don't totally agree. It really depends of your array, the workload and the VM type on your cloudstack. Actually when i build a zfs array i count 1Gb ram for 1Tb disks. With deduplication i count 2Gb ram for 1 Tb disks (plus in nexenta 4 they will add a possibility to use ssd for the deduplication metada, like the l2arc for the read cache). The last point it's about your drives, why sata drives? I mean sata doesn't respect their latency, and the reliability of a sas drive is 10x the sata. Plus now you can found many NL-sas drive at low cost But it's really a nice architecture, i never try glusterfs (for the moment) plus in that case it's really a good way to have a replacement to a metro-cluster for free, i try one of is competitor (onefs) and clearly clustered filesystem are the futur. Cheers, Fabrice -----Message d'origine----- De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoy=E9 : mardi 23 octobre 2012 11:37 =C0 : cloudstack-users@incubator.apache.org Objet : Re: Primary Storage Hi Fabrice, I don't know what other people do but i have no problems with Infiniband + GlusterFS + Cloudstack I'm not use Nexenta it's based on Illumos and work not fine with Infiniband. I have two different clusters in productiv enviroments. The first: Solaris 11 with built-in GlusterFS 3.3 export Gluster Vol. with RDMA. > performance is okey you can use that for smaller enviroments The Second: Is a little bit complex with a GlusterFS Server in the middle ZFS Server: based on Solaris 11 1.) Create a zpool min. two vdevs and SSD read/write cache 2.) Create a thin provisioning Volume "zfs create -V" disable atime and enable compression (do not enable dedup!) and export as (iWarp) SRP Target to the GlusterFS Server use a direct connection without IB Switch. GlusterFS Server: 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6) 2.) Use OFED driver from https://www.openfabrics.org 3.) Import the SRP Target from ZFS Server and format as xfs 4.) Create a Glustervolume "volume create xy transport rdma " (use only rdma) 5.) Connect with the second IB Port to a IB Switch Cloudstack Hypervisor Node: 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6) 2.) Use OFED driver from https://www.openfabrics.org 3.) Import the Glustervolume ZFS Thin Volume ---- Infinband SRP ----> GlusterFS ---- GFSVol rdma ----> IB Switch ----> Clients The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit Infiniband Point-to-Point You do not feel that is a cable between! Important: When you have Infiniband use not IPoIB with iSCSI! If one already has Infiniband then you should also use the advantage. IPoIB have a higher Latency as iWarp SRP! SRP =3D usec -- SRP -- local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr 0x00000001dda000 remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey 0x1c0f115 VAddr 0x000000088e6000 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 1.29 125.30 1.31 ------------------------------------------------------------------ -- IPoIB --- [root@sv01sfogaa ~]# ping 10.200.0.10 PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data. 64 bytes from 10.200.0.10: icmp_seq=3D1 ttl=3D255 time=3D0.147 ms 64 bytes from 10.200.0.10: icmp_seq=3D2 ttl=3D255 time=3D0.116 ms When you put load on IPoIB then increases the latency that is not good That is my recommendation for a simple GlusterFS Mirror: - Supermicro Server with Intel Hardware and Expander Backplane - 1x Crucal-M4 SSD Read Cache - 2x ZeusIOPs SSD Write cache (mirror) - SATA 24/7 Harddrives - LSI HBA 9207 or 9211 - ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full warrany for 100$) Importent > Flash newest Firmware from Mellanox! - Mellanox IB Swich - Solaris 11 - GlusterFS 3.3 compiled with ib_verbs - Gluster Volume transport only rdma Throughput constant up to 200 Mbyte/s >> more throughput with more storage servers or more harddrives on jbod >> >> >> Info: - I have some problems with infiniband RDMA or SRP with OpenIndiana or Illumos or Nexenta. Some adapters have a high latency or not a stable connection. Use Solaris that's the right way! - OpenIndiana is Beta! Infiniband ib_verbs works not or not fine! - Use Solaris 11 Infiniband ib_verbs are native and stable - Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or CentOS and install the right drivers from https://www.openfabrics.org/downloads/OFED/ - You have not SSD Cache? Disable at the ZFS Volume the sync! Important! You loose Security for your Data but some protocols use sync flags in transport. Example NFS use by default fsync. write cache are not active. NFS writes data direct to the Harddrive. For Data Security and Performance give the Storage Server a SSD write cache. ZFS works at default with sync=3Dstandard that prevent write holes. (COW System) I hope that I could help a little Greeting from Germany Andreas ----- Urspr=FCngliche Mail ----- Von: "Fabrice Brazier" An: cloudstack-users@incubator.apache.org Gesendet: Dienstag, 23. Oktober 2012 09:55:15 Betreff: RE: Primary Storage Hi Andreas, Hello i just see your configuration, it seems quite interesting. If i understand well you want to build some zfs array on the backend. Export luns (probably by iscsi over infiniband) to you linux cluster, and on the linux cluster you put glusterFS. I can understand the point, with that you can have very good performance and reliability (zfs), scalability and redundancy (gluster) for very low cost. So just one question, did you try the global namespace implementation from nexenta? If yes can you tell me what configuration is the best for you? I mean the fact you have a gluster cluster in the middle must impact the overral performance no? Fabrice -----Message d'origine----- De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoy=E9 : mardi 23 octobre 2012 05:40 =C0 : cloudstack-users@incubator.apache.org Objet : Re: Primary Storage Hi, for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That gives the best performance and most scalable Storage. I have tasted some different solutions for primary Storage but the most are to expensive and for a CloudStack Cluster not economic or have a poor performance. My Configuration: Storage Node: Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX 40Gbit/s Infiniband adapter. I have installed GlusterFS direct on Solaris with a modified code. Want you build bigger systems for more then 50 VMs it is better you split the Solaris and GlusterFS with a separte headnode for GlusterFS That looks like: Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP Target attached direct without Infiniband switch to GF Node) --> GlusterFS Node the srp target formatted with xfs filesystem, create a GlusterFS Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node mount glusterFS Volume over RDMA For the Dataset Volume at the ZFS Storage, disable atime and enable compression. (Space reclaim) With compression you can shrink the ZFS Volume with command at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space left on the Primary Storage for deleted Files in a VM or for deleted vhd's or vm's in the cloudstack greeting Andreas Mit freundlichen Gr=FC=DFen Andreas Huser Gesch=E4ftsf=FChrer System Engineer / Consultant (Cisco CSE, SMBAM, LCSE, ASAM) --------------------------------------- Zellerstra=DFe 28 - 77654 Offenburg Tel: +49(781) 12786898 Mobil: +49(176) 10308549 ahuser@7five-edv.de ----- Urspr=FCngliche Mail ----- Von: "Outback Dingo" An: cloudstack-users@incubator.apache.org Gesendet: Dienstag, 23. Oktober 2012 02:15:16 Betreff: Re: Primary Storage On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguezwrote: Solaris 11 ZFS and yes we tried different setups, raids levels number >> of SSD cache, ARC zfs options etc etc etc. >> >> Cheers >> >> >> VMWare ?? >> -- Regards, Nik Nik Martin VP Business Development Nfina Technologies, Inc. +1.251.243.0043 x1003 Relentless Reliability --_000_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

You’re going to see a dip in perf= ormance with KVM primarily because of the qcow2 format (KVM people please c= orrect me if I’m wrong because I’m new to KVM).  Due to ti= me constraints I wasn’t able to run any tests with writeback cache turned on or off= but I’ve seen people saying both in forums that having it on improve= s performance, and having it off improves performance to who knows.

 

What I can tell you is that in my tests with our current setup I was able t= o achieve about ~4-5k IOPS for random reads/writes using a multitude of dif= ferent benchmark utilities in both Linux and Windows with KVM’s raw d= isk format.  When I switched over to qcow2 it went down to about 2-2.5k.  With the way our test RAID system is c= onstructed our theoretical max is about 5k IOPS so I’m not sure if th= e drop is something related to the overall potential (e.g. 50% of the total= ) or simply a hard cap of that number but I haven’t had a chance to really kick things around much.  This t= est was done using an iSCSI LUN mounted from a Compellent SAN on 10gbit Eth= ernet and OCFS2 on top of it.  I know there was a comment earlier sayi= ng that OCFS2 was old but what I can say about it is that it was extremely easy to configure and get = up and running and seemed to be pretty solid when we tested it about a year= ago as well as the other week.

 

Regarding your ZFS on Linux question.&n= bsp; ZFS on Linux just isn’t there yet.  It’s too new and = the work that’s being done on it right now is primarily around stabil= ity and integrating features and not about performance so if you’re going to do ZFS do i= t on either BSD or Solaris/Illumos.  The alternative to that is the ha= rdware equivalent with something like LSI’s cachecade which will do b= asically the same thing and it will work just fine with Linux.

 

From: Trevor F= rancis [mailto:trevor.francis@tgrahamcapital.com]
Sent: Wednesday, October 24, 2012 8:27 AM
To: cloudstack-users@incubator.apache.org
Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT<= /o:p>

 

So, since ZFS is super slow on Centos, would it be b= etter to use ext4 and bcache over the native linux NFS daemon?

 

Also, I am reading the KVM is much more disk perform= ant using writeback caching. Does Cloudstack deploy the KVM VM instances wi= th this enabled by default?

 

 

 

Trevor Francis

Partner

46 Labs | The PeerEdge Clou= d

405-362-0046 - Voice  = | 405-410-4980 - Cell

trevorgfrancis - Skype=

 

Solutions Provider for the = Telecom Industry

 

On Oct 24, 2012, at 8:50 AM, Nik Martin <nik.martin@nfinausa.com> wrote:<= o:p>



On 10/24/2012 08:29 AM, Trevor Francis wrote:

Seems my original post has taken a detour. Who knew = storage could be so
controversial!

Since we are a Centos house and it would be nice to have a standard OS
across our platform, but many people have been singing ZFS' praises. To
bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS
for linux.
Does anyone have experience with this? Would it be better to deploy
Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or
bcache?



I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote= I got was "ZFS is slow on Linux". I then asked "Slow relati= ve to what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the res= ponse was "yes", so I'm being led to believe that it does not perform as well as other filesystems.  For large file storag= e on Linux, which Xen VHDs are, I tend to use XFS if the network protocol i= s NFS.


Also, is the standard linux NFS daemon sufficient to= handle a
significant load or do you recommend a different NFS daemon....or go iSCSI = ?
Our biggest concern is that we have most VMs doing many sequential small writes (log files).


When I first started with CS on Xenserver in January, I thought thin provis= ioning was important, so I built file system based storage using NFS, with = XFS as the underlying filesystem, on LVM.  NFS (v3) on Centos 6 became= processor bound VERY quickly, running %300 CPU load on a quad core Xeon.

I switched to iSCSI with LVM as backing store, and havwe had zero issues, e= xcept I'm currently investigating lower than desired random read/write on &= lt; 4k random.  Using tests like Iozone in a VM, and running iostat -t= 1 on the hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it drops to 300 with < 4k blocks = of data. I expect to resolve this soon though.



Thoughts before I make a decision?



Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
<http://www.peeredge.net/> 405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com <mailto:trevor@46labs.com>
Solutions Provider for the Telecom Industry

<http://www.twitter.com/peer= edge><http://www.twit= ter.com/peeredge><htt= p://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>

On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fabrice.brazier@apalia.net
<mailto:fabrice.brazier@ap= alia.net>> wrote:


Hi,

@bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i<= br> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
disks).

After that a velociraptor disk it's only a 10k discs so we can count 120 IOPs by disks. If we put them in a raid 5 (by the way everybody know that raid 5 don't scale very well after 12 disk the performance doesn't
increase
so much but nevermind you built your raid 5 of 30 disks).
So you have a maximum of (30-1)*120=3D3480 iops for read and
((30-1)*120)/5=3D700 iops write (in a 100% random profile) And you have onl= y
32Gb ram wich is peanuts to caching your read iops.

So the only way you have to make performance is your cahecade.
Now a cachecade is made of ssd (weird actually it's the same as zil and
l2arc in zfs xD) plus the max size of a cachecade and the maximum size
of a
cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade max.

So you say your storage doesn't cost anything but you don't talk about the<= br> size and the price of your cachecade.
And another thing the day you will have a real workload (not only 50
Vms...)
and this workload will be larger than 1Tb come again and explain us your very large performance.

Me too I can built a zfs server with 4 zeusram in zil and 5 disks
behind and
make a benchmark on files of 2Gb and say hey see I have 500 000 Iops
and 4Gb
of throughput...

Regards

-----Message d'origine-----
De : bruce.m@v365.com.au <mailto:bruce.m@v365.com.au>
[mailto:bruce.m@v365.com.au <http://v365.com.au>]
Envoy=E9 : mercredi 24 octobre 2012 06:04
=C0 : cloudstack-u= sers@incubator.apache.org
<mailto:cloudst= ack-users@incubator.apache.org>
Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT



These Sans we have build have saved us a fortune $ compared to FC
Block I/O
Sans from IBM, HP etc

Ive been a contractor for IBM and HP
for 20 years. I know how much they are $$$$$ and what they can do.

A
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your lucky you may get 100,000 perhaps 150,000 IOPS

What ive build is
better. cheaper, and frankly as i support Linux and SCST, much more
reliable.

the WD VR 1TB we get new for $310 so 30 x =3D$9,300 LSI 92808-8i x 2 $1800<= br> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam
$800,
Dual PSU $650, about $14,000 for a much superior SAN.

As i explained another $15K for infiniband network. which does SRP and IP
As for your issues with cache misses, the CacheCade doesn't Miss much from<= br> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
primary 22KVA UPS It never fails.

Also Its not
doing very much unless a user in the cloud is hitting sequential blocks or<= br> R/W the same data repeatedly for some weird reason. Even then we see
little
lagg or slow down, when as you say its hitting the Disc.
Infiniband with LSI Just rocks even more than we had thought!

The R/W
pattern from groups of 50 VM.s appear to be completely random, web
servers,
virtual firewalls, email servers, DNS etc, FTP, windows domains,
sharepoint,
MSSQL, MYSQL working on the same Lun which are not considered critical (as<= br> stated by clients) are each VM getting over 300-1200MB/s constantly even when other users are pulling or pushing a file copy @ 250Mb/s between VMs etc. Weve actually had to figure out how to slow down the SRP connections.<= br> If a user is provisioned with 2 Luns from each SAN, in windows you can
copy
between VM Drives @ 1220MB/s!!!!!
faster than SSD in Linux its even faster, and that is in ESXi4.1 with
15~20 Servers per VM Host.

we have snmp monitoring the 2 primary sans, and they never go above 30%
bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
cloud
IO is fairly low. We have LSI supplying us with the new PICE3 cards for a trial, so i expect even better performance. So long STORY SHORT. GET
LSI AND
16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s SSD's its SLOWER! MORE SPINDLES =3D MORE SPEED!

the main reason we wanted fast IO is for backups, snapshots, so as not to interfere with the daily operational workings of the cloud VM guests.
Which
is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will work, and some smart ass does a restore or backup in prod hours, only
to see
the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
With SRP on LSI " WE DONT
HAVE THAT PROBLEM!"

I did a snapshot restore for 2 machines yesterday for a client, it took 2 mins to restore 100GB. And none of the other servers were affected. SNMP latency on all VM's no impact. Virtually 0CPU cycles on the VMware
hosts for
the IO move. IO latency/storage path for other VM's from snmp data from
VMserver was 2ms which is normal. the machine doing the copy crept up to 10ms but no more.

So my conclusion
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
Cache misses we don't have a problem there at all. Were more worried about<= br> Data Rot, which is why were interested in GFS to ZFS Luns.
But ZFS is still slow as compared to what we have.....

we are going to
launch a new script which moves a file full of 16-bit random io for 100MB,<= br> and 1TB, it will create a binary checksum for each file, and we can
compare
it over time to check for data rot.

Just so you can
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts. the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started the 2nd, while the 1st was running, same results across the 2 VM's.
then the
3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
220,000
per host. so yes there is a penalty. But the way infiniband works is that the switch balances the load across all servers. 4th server online same
test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s
@ 128K
blocks 50%50% random seq. change the distribution to 100% random makes
little difference. We confirmed all the drives were being accessed! all
were. we even took a look at the camera out DC to monitor the array
LEDS...
alll Flickers like mad! we took a look at the metrics of the Megaraid
as we
got it to work on CentOS, no errors, no issues or problems, no dropped
reads
or writes.

we set the file on IO meter as high as poss 900000000 units. We then fired<= br> up 2 more hosts and did real world file copy, no real impact still get
2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurren= t
file copies between the 2 sans and each host blew us away at 2000MB/s read<= br> and write across all 6 hosts.!! FXXK we were amazed

Aside
from that LSI have done a super job on the Cache modules on their cards. I dont see how ZFS will ever approach this kind of performance.

Anyway
take from what ive said what you like. We have an awsum SAN... and its
CHEAP
and Amazingly fast!

were in the process of developing a public Cloud, so when the web site is finn we will publish all of this in a whitepaper there.

On 24.10.2012 10:57, Jason Davis wrote:

Passionate

are you? Take it easy I am not knocking on what you = are doing.


The

term "spinning rust" is a catch all for ha= rd drives, nothing personal

 

at all, in fact the setup you have is impressive.


Impressive until

you have a cache miss. Cache misses are definitely a= n

issue even in

"capital E" enterprise storage.


Big bad XIV disk arrays or 3par are

impressive, fast arrays (crazy

expensive ;]) when you are hitting DRAM

or SSD but if you are running cache

unfriendly workloads then

ultimately your array is only as fast as the

spinning disk sitting at

the lowest tier.


I agree that openstorage is the way to go, from a

service providers

perspective it's hard being competitive with AWS or<= o:p>

RackSpace if you are

using expensive traditional SAN.

So anyway,

my question wasn't a poke at what you are doing, jus= t curious

about

how big of an IOP/latency drop you see once a read o= r write has to

 

actually see the spinning disk. :)

On Oct 23, 2012 8:10 = PM, wrote:





oh well if your negative, and make rude comments then that's

your

problem, i don't care about you.

If you don't believe that

RDMA can

perform this fast then again, your problem.

Ask anyone

with a decent

12+ Disc Raid Array, and RDMA and they will tell= you it

will kick you in

YOUR ballz! Anyone in storage knows to get iops and<= o:p>

performance you need

12 preferably 16 spindles.

the spinning

rust as you call it, are the

newest 200MB/s WD Veloci Raprors with

64MB Cache. @ 10KRPM 2.5" dics in

32 Bay Storage Chassis.

we've

used them in the past and they are as

reliable as the SAS drives we

use, but faster!


RAID 10 IS BULLSHIT!
Bloody slow and 50% goes

up in smoke for nothing. All you get is the

mirror speed of 2 drives

which is barley 400MB/s tops! I Spit on RAID 10



We also tried

Raid 60 and it was very good as well. But no point t= o

use it as we

replicated to 2nd SAN


When we tested our new SANs, we
failed 1

drive and set the rebuild rate to 30% on raid 6, wit= h 40% scan

rate.

The SAN only had 6TB of data on it. and it rebuilt i= n 3 hours. we

 

removed a 2nd disc, and replaced with blank, and it = was rebuild in

 

another 4 hours. With no real impact to performance = tests.

 

Separating

the underlying 6Gb/s Sata ports x 8 into effectively= 16 x

3Gb/s Channels

using a SAS Expander gives each Disc 300MB/s of=

bandwidth. They cant

physically perform better than 200~220MB/s so

there is enough bandwidth

on the SATA and PCIE Bus to cope. LSI rate

the 9280-8i @ 2500MB/s but it

does work faster with Cache Cade. up to

3200MB/s in that test.


So a
real SAN has many spindles of high

performance, and the WD VRs are

better than some cheap SSD

drives.


RDMA is very fast, low cpu cycles,
and reads and write

directly to RAM@ 40Gb/s, we created a RAM drive and<= br>

tested it up to

3200MB/s which is as fast as the PCIE 2 Bus / LSI co= uld

Handle.

If you cant afford this realtivley cheap equipment (= compated to

IBM

DS or HP StorageWorks SAN) then dont wing at ME. You= cant get fast

 

performance. Any real Cloud would have what we have = built. None of our

customers, ALL 300 Of them! have lost a single bit o= f data, all

VMs are

running very quickly, and with no problems now for 3=

months.


So If you
dont appreciate being shown how to build a

kick ASS SAN then PISS

OFF!

On 24.10.2012 08:17, Jason Davis

wrote:

How often do you have

folks cache missing and hitting the

spinning rust

below the cache cade

tier?

On Oct 23, 2012

6:25 PM, wrote:





Hi

with our

SCST

RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid

Director 4036

 

switches + HP Gen2 QDR Chassis Switches.


Our San achieves



295,000 - 350,000 Iops max per LSI controller, depending on

block

size

for i/o meter test. LSI9280-8i with battery cache.

+

cachecade2.0 x 2 in

PCIE 16x slot on supermicro

motherboard.


1

x 250GB SSD for CACHE Cade +

16 1tb

WD VR drives (16 - hot stand buy

- 2 for raid 6 parity) =3D

 

2600MB/s


most of our test are from 4K

to 128K block size,

@ 4K we get

295,000 Iops @ 1590 MB/s and @ 128K

350,000 iops @

2400~2600MB/s


we
have tuned 64K and 128K block

size

on different luns. 64K for database

and 128K for general

 

file.


The best thing about Infiniband is low CPU

cycles. only

5~6% during these tests.


latency is as low

as 4-6ms

average

read time. concurrent response times are from

5-12ms. even under

 

heavy load its below 20ms

 

Infiniband latency is below 0.01 us

Which is

why we chose

it.


we run MSSQL on the 64K formatted

Luns, its

massively

fast.

If we copy the same data twice the

 

cache cade kicks in and we

achieve even better s= peeds.

Ive

compared this to block io @ 8Gb/s Fiber

channel, and it

barely gets

120,000 IOPS. @ much higher latency and

 

bandwidth.


So for our

money RDMA wins!

 

Bruce


On 23.10.2012 21:20,
Jason Davis

wrote:

 

Out of curiosity, is there any quick performance

numbers

for

these ZFS +

GlusterFS mashups you guys are talking

 

about?

 



Specifically, IOPs and latency? Sequential

read/write

 

performance honestly

isn't a very good

benchmark to determine

your

SANs performance. It's like

 

comparing CPUs based solely on

how many

GHz it runs at. Sure

you can get

great MB or GB/s with

SATA disk but

I'd

reckon that IOP performance is

abismal. If you

are

utilizing

GlusterFS without the cache pooling magic

that

is

ZFS then I would

imagine that latency = can be an issue.



 

 



On Tue, Oct 23, 2012 at

7:56 AM,

Andreas Huser wrote:

 


Hi Fabrice,


i

know

OpenSolaris/Solaris Oracle it's so

a thing.

I'm

for more then 10

years a open source user and

that

 

with oracle - i did no like at the

beginning of this

 

constallation.

But Oracle makes his work good i

know

that. The

cost of one socket

are 700$ and you can use so

much

quantity of

TB as you will.

And you can use the

full premier Support

from

Oracle.

Nexenta develop with

the Illumos code. And the Licence

 

are TB based.

That

is not my favorite. As well the pool version

from

Nexenta

comes

not after. Current Nexenta Infiniband are=

not a

usable

solution.

But every can use what he will.

Everyone

must decide for

themselves.


SRP

Targets or iser are not

difficult to configure.

Use the SRP

for

the Storage unit

connection. Solaris and GlusterFS

 

builds one Storage unit.

The

GlusterFS Server export the

final Volume

to the Clients as well

KVM,

VMWare,

Hyper-V etc.

You can use

native GlusterFS,

RDMA, NFS

ore CIFS to export the Volume.

SRP have

nothing to do

 

with VmWare.


When you are use a 7200 SAS drive the


access
time are the same as a

SATA drive only the quality

of the

 

hardware are better. When you need

Performance

you must use SAS

 

drives with 15000U/m. But it's not needed


when you install SSD

for

ZIL/L2ARC. ZeusRAM rocks :-)





I use dedup only at

secondary stroage

or on

Backupserver not on Primary

Storage.


When you use

SSD SATA

drives then you have an cheap an fast

strorage.

 

1TB drive cost unter

100$. Current i'm not need to

save

storage volume.


Which

application use atime? I

know

"find -atime N". atime on a Storage

 

that stores

only Virtual

Disks?

I don't need the information when i


have last opening
the disk :-)


For a Solaris GUI

take a look at

napp-it

http://www.napp-it.org/<= /p>



greetings from german= y

 

Andreas

 







----- Urspr=FCngliche Mail -----

 




Von:

"Fabrice Brazier">> An:

 

cloudstack-users@incubator.apache.org
<mailto:cloudst= ack-users@incubator.apache.org>

 

Gesendet: Dienstag,

23.

Oktober 2012 12:30:50

Betreff: RE: Primary

 

Storage

 

Hi Andreas,


Hum that's

pretty cool, i know they have

 

still trouble with infiniband

on

nexenta but it's clearly a

priority

on their

roadmap (plus i trust more the

nexenta team

to evolve zfs

than oracle).


I agree iscsi over infiniband

 

increase the latency

but most of the time

it's

just more

simple to use IPOIB than IB,

for example if you use

vmware the

 

support of infiniband it's assured

by

mellanox and not vmware so if

you

have
an issue the

vmware

support probably don't help

you anymore.

 

 

About the raid i'm a fan

of raid 10, i

prefer build a raid

10 with 7200 sas

drive than

multiple vdev

raidz/raidz2

15k sas drive. Particularly for

virtual

 

environment

with a random workload and multiple access.


I'm

a

fan

of nfs so agreed about the zil, and zeus ram are the= only

one

for

 

that (with classical ssd you can hit the write<= /o:p>

hole).




Agreed for

compression too (but only

lzjb level gzip use too

much cpu),

disable

atime

permit to decrease the iops load but

i'm not sure is really a

 



best practice. About the deduplication

I don't totally

agree. It

really

depends of your array, the

workload

and the VM type on your

cloudstack.

Actually when i

 

build a zfs array i count 1Gb ram for 1Tb

disks. With

 

 

deduplication i count 2Gb ram for 1 Tb disks (plus i= n

nexenta 4

they

will

add a possibility to use ssd for the

 

deduplication metada,

like the l2arc

for the read cache).





The

last point

it's about your drives, why sata

drives? I mean sata

doesn't

 

respect their latency, and

the reliability of a sas drive is

10x the

sata.

Plus

now you can found many NL-sas drive at low

cost


But it's really a nice architecture, i never try glu= sterfs

 

 

(for the moment)

plus in that case it's really a good way to

have

a

replacement to a

metro-cluster for free, i try

one of is

competitor

(onefs) and clearly

clustered

filesystem are the

futur.

 

Cheers,

 

Fabrice


-----Message

d'origine-----

De :

Andreas

Huser

[mailto:ahuse= r@7five-edv.de]

Envoy=E9 :

mardi 23 octobre 2012

 

11:37

=C0 :

cloudstack-users@incubator.apache.org

Objet :

Re:

 

Primary Storage


Hi Fabrice,

I don't

know

what other people

do but i have no problems with

Infiniband +

 

GlusterFS + Cloudstack

I'm not use

Nexenta it's based on Illumos and

work

not fine with

 

Infiniband.

I have two different

clusters in productiv

 

enviroments.


The first: Solaris 11

with built-in

GlusterFS 3.3

export Gluster Vol. with

RDMA. >

 

performance is okey you can use that

for smaller enviroments

The


Second: Is a little bit complex with a GlusterFS

Server in the

middle


ZFS Server: based on Solaris

11

 

1.) Create a

zpool min. two vdevs and SSD

read/write cache

2.) Create

a thin

provisioning Volume

"zfs create -V" disable atime and

enable

 

compression

(do not enable dedup!) and export as (iWarp) SRP Tar= get

 


to the GlusterFS Server use a direct connection without IB

Switch.

 

GlusterFS Server:

1.) Use

RedHat, CentOS or

Fedora (I use CentOS 5

and 6)

2.)

Use OFED driver from

https://www.= openfabrics.org

3.)

 

Import the SRP Target from ZFS

Server and format as xfs

4.)

Create a

Glustervolume "volume

create xy transport rdma " (use

only

rdma)

5.) Connect

with the second IB

Port to a IB Switch


Cloudstack

 

Hypervisor

Node:

1.) Use RedHat, CentOS or Fedora (I use CentOS 5

 

and

6)

2.) Use OFED driver from

https://www.= openfabrics.org


3.) Import
the

Glustervolume


ZFS Thin Volume ----

Infinband SRP

---->

GlusterFS ---- GFSVol rdma ---->

IB Switch

---->

Clients


The

ZFS and GlusterFS Server form a

Storage

unit connect direct with

40Gbit

Infiniband

Point-to-Point You

do not feel that is a cable

between!



Important: When you

have Infiniband use not IPoIB with

iSCSI! If one

already

has Infiniband then you

should also use the

advantage.

IPoIB

have a higher

Latency as iWarp SRP!



SRP =3D

usec






-- SRP --
local address: LID 0x01 QPN

0x44004b PSN

 

0xf3265b RKey 0x9804237c VAddr

 

0x00000001dda000 remote address:

LID

0x0a QPN 0x10004a PSN

0x44072e RKey

0x1c0f115 VAddr

 

0x000000088e6000

 



 

 

----------------------------------------------------= --------------





#bytes #iterations t_min[usec] t_max[usec]

t_typical[usec]

2

1000

1.29 125.30 1.31

 



----------------------------------------------------= --------------






 

-- IPoIB ---

[root@sv01sfogaa ~]# ping

10.200.0.10


PING
10.200.0.10 (10.200.0.10) 56(84)

bytes of data.

64

bytes from

10.200.0.10: icmp_seq=3D1

ttl=3D255 time=3D0.147 ms

64

bytes from

10.200.0.10:

icmp_seq=3D2 ttl=3D255 time=3D0.116 ms

 

When you put load

on

IPoIB then increases the latency that is not

good

 




That is my

recommendation for a simple

 

GlusterFS Mirror:


- Supermicro Server

with Intel

Hardware

and Expander Backplane

- 1x Crucal-M4 SSD Read


Cache

- 2x

ZeusIOPs SSD Write cache (mirror)

-

SATA 24/7

Harddrives

 

- LSI HBA 9207 or 9211

-

ConnectX-2 QDR Dualport

Infiniband

Adaper (HP refurbished with

full

warrany for 100$)

Importent >

Flash newest

Firmware from Mellanox!

- Mellanox IB

Swich

 

-

Solaris 11

- GlusterFS 3.3 compiled with ib_verbs
-



Gluster Volume transport only rdma







 




Throughput

constant up to 200

Mbyte/s >> more throughput with

more storage servers

or more

harddrives on jbod >> >> >>



 




Info:





- I have some

problems with infiniband

RDMA or

SRP with OpenIndiana or

Illumos or

Nexenta. Some

 

adapters have a high latency or not a stable

 

connection.

Use

Solaris that's the right way!

- OpenIndiana is Beta!


Infiniband
ib_verbs works not or not fine!

- Use Solaris

11

Infiniband

ib_verbs are native and stable

- Don't

use Ubuntu Client

Server

for Infiniband! Use RedHat, Fedora

or

CentOS and install the

 

right drivers from

 

https://www.openfabrics.org/downloads/OFED/


-
You have

not SSD Cache? Disable at the ZFS Volume the sync!

 

 

Important!

You
loose Security for your Data but

some

protocols use

sync flags in

transport.=

Example NFS use by

default fsync. write

cache are not active.

NFS writes

data

direct to the Harddrive. For

Data

Security and Performance give

the

Storage Server a SSD

write

cache. ZFS works at default with

sync=3Dstandard

 

that prevent write

holes. (COW System)

 


I hope

that I could help a little

 

Greeting from

 

Germany

Andreas



-----

Urspr=FCngliche

Mail

-----


Von: "Fabrice

Brazier"

An:

 

cloudstack-users@incubator.apache.org


Gesendet: Dienstag, 23.

Oktober

2012 09:55:15

 

Betreff: RE: Primary Storage

 

Hi Andreas,

 



 

Hello i just see your configuration, it seems

quite

interesting.

If i

understand well you want to build

some

zfs array on the backend.

 

Export luns (probably

by iscsi over

infiniband) to you linux cluster,

and

 

on

the linux

cluster you put glusterFS.

I can

understand

the point, with

that you can have very good

performance

and

reliability

(zfs),

scalability and redundancy (gluster) for very low

cost.


So just one question, did you try the global namespace

 

implementation

from

nexenta?
If yes can you tell me

what

configuration

is the best for you?

I mean the

fact you have a gluster

cluster

in the middle must impact

the

overral performance n= o?



Fabrice


-----Message d'origine-----
De :

Andreas

Huser

[mailto:ahuse= r@7five-edv.de] Envoy=E9 : mardi 23

octobre

 

2012 05:40 =C0 :

 

cloudstack-users@incubator.apache.org

Objet :

Re: Primary

Storage


Hi,

for Cloudstack i

use

Solaris 11 ZFS + GlusterFS

over Infiniband (RDMA).

 

That

 

gives the best performance and most

scalable

Storage.

I have

tasted some different solutions for

primary

Storage but the most

are

to expensive and for

a CloudStack Cluster

not economic or

have a poor

 

performance.


My Configuration:

 

 

Storage Node:

Supermicro Server (Intel Hardware) with Solaris=

11

with

SSD write and read

cache (read crucial-m4,

write ZeusIOPS)

GlusterFS

and dualport ConnectX

 

40Gbit/s Infiniband adapter.




I have

installed

GlusterFS direct on Solaris with a modified

code.

Want

you

build bigger systems for more then 50 VMs it is=

better you

split

the

Solaris and GlusterFS with a separte

 

headnode for GlusterFS


That

looks like:

 

Solaris ZFS

Backendstorage with a dataset Volume (Thin

 

Provision) --> ( SRP

 

Target attached direct without

Infiniband switch

to GF Node) -->

GlusterFS

Node the

srp target formatted with xfs

filesystem,

create a GlusterFS


Volume --> ( Infiniband over a
Mellanox Port

Switch) -->

Cloudstack Node

mount glusterFS Volume over

RDMA




For the Dataset Volume at the ZFS Storage, disable a= time

and

 

enable

compression.
(Space reclaim) With

compression you

can

shrink the ZFS Volume with command

 

at Linux dd /dev/zero or

In a

Windows VM with sdelete That

gives you space

left on the

Primary

Storage for

deleted Files in a VM or for deleted vhd's

 

or vm's in the

cloudstack


greeting Andreas







Mit freundlichen

Gr=FC=DFen

 

 

Andreas Huser


Gesch=E4ftsf=FChrer

System Engineer

/

Consultant

(Cisco

CSE, SMBAM, LCSE, ASAM)

 



 

---------------------------------------

 

Zellerstra=DFe 28 - 77654

 

Offenburg

Tel: +49(781)

12786898

Mobil: +49(176) 10308549



ahuser@7five-= edv.de





-----

Urspr=FCngliche Mail -----

 

Von: "Outback

Dingo"

An:

cloudstack-users@incubator.apache.org

 


Gesendet: Dienstag, 23.
Oktober 2012 02:15:16

Betreff:

Re: Primary

Storage

 

On Mon, Oct 22, 2012 at

8:09 PM, Ivan

Rodriguezwrote:

 





 

Solaris 11 ZFS and yes we tried different

setups, raids

levels

number >> of SSD cache, ARC zfs options etc et= c

etc. >> >>

 

Cheers >> >> >> VMWare ?? >>=





 



 

 

 



--
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

 

--_000_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_-- --_005_F95EC5DFF06AE04FA54C508ED869D0D2942A0Fagcex01CORPISWEST_--