Return-Path: X-Original-To: apmail-incubator-cloudstack-users-archive@minotaur.apache.org Delivered-To: apmail-incubator-cloudstack-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 75262D710 for ; Tue, 23 Oct 2012 23:25:19 +0000 (UTC) Received: (qmail 9383 invoked by uid 500); 23 Oct 2012 23:25:19 -0000 Delivered-To: apmail-incubator-cloudstack-users-archive@incubator.apache.org Received: (qmail 9364 invoked by uid 500); 23 Oct 2012 23:25:18 -0000 Mailing-List: contact cloudstack-users-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cloudstack-users@incubator.apache.org Delivered-To: mailing list cloudstack-users@incubator.apache.org Received: (qmail 9353 invoked by uid 99); 23 Oct 2012 23:25:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2012 23:25:18 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [203.170.85.160] (HELO rsl-cp3.per.syra.net.au) (203.170.85.160) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2012 23:25:11 +0000 Received: from localhost ([127.0.0.1]:60997 helo=rsl-cp3.syra.net.au) by rsl-cp3.per.syra.net.au with esmtpa (Exim 4.80) (envelope-from ) id 1TQnq0-0001Hb-QA for cloudstack-users@incubator.apache.org; Wed, 24 Oct 2012 07:24:45 +0800 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_36e98c6ae364616c0efa6c329989ab7f" Date: Wed, 24 Oct 2012 07:24:43 +0800 From: bruce.m@v365.com.au To: Subject: Re: Primary Storage In-Reply-To: References: <4ddf01e98f600e9677bf953a25bce8f6@mail.gmail.com> Message-ID: X-Sender: bruce.m@v365.com.au User-Agent: Roundcube Webmail/0.6 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - rsl-cp3.per.syra.net.au X-AntiAbuse: Original Domain - incubator.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - v365.com.au X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked by ClamAV on apache.org --=_36e98c6ae364616c0efa6c329989ab7f Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi with our SCST RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid Director 4036 switches + HP Gen2 QDR Chassis Switches. Our San achieves 295,000 - 350,000 Iops max per LSI controller, depending on block size for i/o meter test. LSI9280-8i with battery cache. + cachecade2.0 x 2 in PCIE 16x slot on supermicro motherboard. 1 x 250GB SSD for CACHE Cade + 16 1tb WD VR drives (16 - hot stand buy - 2 for raid 6 parity) = 2600MB/s most of our test are from 4K to 128K block size, @ 4K we get 295,000 Iops @ 1590 MB/s and @ 128K 350,000 iops @ 2400~2600MB/s we have tuned 64K and 128K block size on different luns. 64K for database and 128K for general file. The best thing about Infiniband is low CPU cycles. only 5~6% during these tests. latency is as low as 4-6ms average read time. concurrent response times are from 5-12ms. even under heavy load its below 20ms Infiniband latency is below 0.01 us Which is why we chose it. we run MSSQL on the 64K formatted Luns, its massively fast. If we copy the same data twice the cache cade kicks in and we achieve even better speeds. Ive compared this to block io @ 8Gb/s Fiber channel, and it barely gets 120,000 IOPS. @ much higher latency and bandwidth. So for our money RDMA wins! Bruce On 23.10.2012 21:20, Jason Davis wrote: > Out of curiosity, is there any quick performance numbers for these ZFS + > GlusterFS mashups you guys are talking about? > > Specifically, IOPs and latency? Sequential read/write performance honestly > isn't a very good benchmark to determine your SANs performance. It's like > comparing CPUs based solely on how many GHz it runs at. Sure you can get > great MB or GB/s with SATA disk but I'd reckon that IOP performance is > abismal. If you are utilizing GlusterFS without the cache pooling magic > that is ZFS then I would imagine that latency can be an issue. > > > > On Tue, Oct 23, 2012 at 7:56 AM, Andreas Huser wrote: > >> Hi Fabrice, >> >> i know OpenSolaris/Solaris Oracle it's so a thing. >> I'm for more then 10 years a open source user and that >> with oracle - i did no like at the beginning of this constallation. >> But Oracle makes his work good i know that. The cost of one socket >> are 700$ and you can use so much quantity of TB as you will. >> And you can use the full premier Support from Oracle. >> Nexenta develop with the Illumos code. And the Licence are TB based. >> That is not my favorite. As well the pool version from Nexenta comes >> not after. Current Nexenta Infiniband are not a usable solution. >> But every can use what he will. Everyone must decide for themselves. >> >> SRP Targets or iser are not difficult to configure. Use the SRP for >> the Storage unit connection. Solaris and GlusterFS builds one Storage unit. >> The GlusterFS Server export the final Volume to the Clients as well KVM, >> VMWare, Hyper-V etc. >> You can use native GlusterFS, RDMA, NFS ore CIFS to export the Volume. >> SRP have nothing to do with VmWare. >> >> When you are use a 7200 SAS drive the access time are the same as a >> SATA drive only the quality of the hardware are better. When you need >> Performance you must use SAS drives with 15000U/m. But it's not needed >> when you install SSD for ZIL/L2ARC. ZeusRAM rocks :-) >> >> I use dedup only at secondary stroage or on Backupserver not on Primary >> Storage. >> When you use SSD SATA drives then you have an cheap an fast strorage. >> 1TB drive cost unter 100$. Current i'm not need to save storage volume. >> >> Which application use atime? I know "find -atime N". atime on a Storage >> that stores only Virtual Disks? >> I don't need the information when i have last opening the disk :-) >> >> For a Solaris GUI take a look at napp-it >> http://www.napp-it.org/ >> >> greetings from germany >> Andreas >> >> >> >> >> ----- Ursprüngliche Mail ----- >> >> Von: "Fabrice Brazier">> An: cloudstack-users@incubator.apache.org >> Gesendet: Dienstag, 23. Oktober 2012 12:30:50 >> Betreff: RE: Primary Storage >> >> Hi Andreas, >> >> Hum that's pretty cool, i know they have still trouble with infiniband on >> nexenta but it's clearly a priority on their roadmap (plus i trust more the >> nexenta team to evolve zfs than oracle). >> >> I agree iscsi over infiniband increase the latency but most of the time >> it's >> just more simple to use IPOIB than IB, for example if you use vmware the >> support of infiniband it's assured by mellanox and not vmware so if you >> have >> an issue the vmware support probably don't help you anymore. >> >> About the raid i'm a fan of raid 10, i prefer build a raid 10 with 7200 sas >> drive than multiple vdev raidz/raidz2 15k sas drive. Particularly for >> virtual environment with a random workload and multiple access. >> >> I'm a fan of nfs so agreed about the zil, and zeus ram are the only one for >> that (with classical ssd you can hit the write hole). >> >> Agreed for compression too (but only lzjb level gzip use too much cpu), >> disable atime permit to decrease the iops load but i'm not sure is really a >> best practice. About the deduplication I don't totally agree. It really >> depends of your array, the workload and the VM type on your cloudstack. >> Actually when i build a zfs array i count 1Gb ram for 1Tb disks. With >> deduplication i count 2Gb ram for 1 Tb disks (plus in nexenta 4 they will >> add a possibility to use ssd for the deduplication metada, like the l2arc >> for the read cache). >> >> The last point it's about your drives, why sata drives? I mean sata doesn't >> respect their latency, and the reliability of a sas drive is 10x the sata. >> Plus now you can found many NL-sas drive at low cost >> >> But it's really a nice architecture, i never try glusterfs (for the moment) >> plus in that case it's really a good way to have a replacement to a >> metro-cluster for free, i try one of is competitor (onefs) and clearly >> clustered filesystem are the futur. >> >> Cheers, >> Fabrice >> >> -----Message d'origine----- >> De : Andreas Huser [mailto:ahuser@7five-edv.de] >> Envoyé : mardi 23 octobre 2012 11:37 >> À : cloudstack-users@incubator.apache.org >> Objet : Re: Primary Storage >> >> Hi Fabrice, >> >> I don't know what other people do but i have no problems with Infiniband + >> GlusterFS + Cloudstack I'm not use Nexenta it's based on Illumos and work >> not fine with Infiniband. >> I have two different clusters in productiv enviroments. >> >> The first: Solaris 11 with built-in GlusterFS 3.3 export Gluster Vol. with >> RDMA. > performance is okey you can use that for smaller enviroments The >> Second: Is a little bit complex with a GlusterFS Server in the middle >> >> ZFS Server: based on Solaris 11 >> 1.) Create a zpool min. two vdevs and SSD read/write cache >> 2.) Create a thin provisioning Volume "zfs create -V" disable atime and >> enable compression (do not enable dedup!) and export as (iWarp) SRP Target >> to the GlusterFS Server use a direct connection without IB Switch. >> >> GlusterFS Server: >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6) >> 2.) Use OFED driver from https://www.openfabrics.org >> 3.) Import the SRP Target from ZFS Server and format as xfs >> 4.) Create a Glustervolume "volume create xy transport rdma " (use only >> rdma) >> 5.) Connect with the second IB Port to a IB Switch >> >> Cloudstack Hypervisor Node: >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6) >> 2.) Use OFED driver from https://www.openfabrics.org >> 3.) Import the Glustervolume >> >> ZFS Thin Volume ---- Infinband SRP ----> GlusterFS ---- GFSVol rdma ----> >> IB Switch ----> Clients >> >> The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit >> Infiniband Point-to-Point You do not feel that is a cable between! >> >> Important: When you have Infiniband use not IPoIB with iSCSI! If one >> already >> has Infiniband then you should also use the advantage. >> IPoIB have a higher Latency as iWarp SRP! >> >> >> SRP = usec >> >> -- SRP -- >> local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr >> 0x00000001dda000 remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey >> 0x1c0f115 VAddr 0x000000088e6000 >> ------------------------------------------------------------------ >> #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] >> 2 1000 1.29 125.30 1.31 >> ------------------------------------------------------------------ >> >> -- IPoIB --- >> [root@sv01sfogaa ~]# ping 10.200.0.10 >> PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data. >> 64 bytes from 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms >> 64 bytes from 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms >> >> When you put load on IPoIB then increases the latency that is not good >> >> >> That is my recommendation for a simple GlusterFS Mirror: >> >> - Supermicro Server with Intel Hardware and Expander Backplane >> - 1x Crucal-M4 SSD Read Cache >> - 2x ZeusIOPs SSD Write cache (mirror) >> - SATA 24/7 Harddrives >> - LSI HBA 9207 or 9211 >> - ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full >> warrany for 100$) Importent > Flash newest Firmware from Mellanox! >> - Mellanox IB Swich >> - Solaris 11 >> - GlusterFS 3.3 compiled with ib_verbs >> - Gluster Volume transport only rdma >> >> >> >> >> >>> Throughput constant up to 200 Mbyte/s >> more throughput with more storage servers or more harddrives on jbod >> >> >> >> >> >> Info: >> >> - I have some problems with infiniband RDMA or SRP with OpenIndiana or >> Illumos or Nexenta. Some adapters have a high latency or not a stable >> connection. Use Solaris that's the right way! >> - OpenIndiana is Beta! Infiniband ib_verbs works not or not fine! >> - Use Solaris 11 Infiniband ib_verbs are native and stable >> - Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or >> CentOS and install the right drivers from >> https://www.openfabrics.org/downloads/OFED/ >> - You have not SSD Cache? Disable at the ZFS Volume the sync! Important! >> You >> loose Security for your Data but some protocols use sync flags in >> transport. >> Example NFS use by default fsync. write cache are not active. NFS writes >> data direct to the Harddrive. For Data Security and Performance give the >> Storage Server a SSD write cache. ZFS works at default with sync=standard >> that prevent write holes. (COW System) >> >> I hope that I could help a little >> >> Greeting from Germany >> Andreas >> >> >> >> ----- Ursprüngliche Mail ----- >> >> Von: "Fabrice Brazier" >> An: cloudstack-users@incubator.apache.org >> Gesendet: Dienstag, 23. Oktober 2012 09:55:15 >> Betreff: RE: Primary Storage >> >> Hi Andreas, >> >> Hello i just see your configuration, it seems quite interesting. >> If i understand well you want to build some zfs array on the backend. >> Export luns (probably by iscsi over infiniband) to you linux cluster, and >> on >> the linux cluster you put glusterFS. >> I can understand the point, with that you can have very good performance >> and >> reliability (zfs), scalability and redundancy (gluster) for very low cost. >> So just one question, did you try the global namespace implementation from >> nexenta? >> If yes can you tell me what configuration is the best for you? >> I mean the fact you have a gluster cluster in the middle must impact the >> overral performance no? >> >> Fabrice >> >> -----Message d'origine----- >> De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre >> 2012 05:40 À : cloudstack-users@incubator.apache.org >> Objet : Re: Primary Storage >> >> Hi, >> >> for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). >> That >> gives the best performance and most scalable Storage. >> I have tasted some different solutions for primary Storage but the most are >> to expensive and for a CloudStack Cluster not economic or have a poor >> performance. >> >> My Configuration: >> Storage Node: >> Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read >> cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX >> 40Gbit/s Infiniband adapter. >> >> I have installed GlusterFS direct on Solaris with a modified code. >> Want you build bigger systems for more then 50 VMs it is better you split >> the Solaris and GlusterFS with a separte headnode for GlusterFS >> >> That looks like: >> Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP >> Target attached direct without Infiniband switch to GF Node) --> GlusterFS >> Node the srp target formatted with xfs filesystem, create a GlusterFS >> Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node >> mount glusterFS Volume over RDMA >> >> For the Dataset Volume at the ZFS Storage, disable atime and enable >> compression. >> (Space reclaim) With compression you can shrink the ZFS Volume with command >> at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space >> left on the Primary Storage for deleted Files in a VM or for deleted vhd's >> or vm's in the cloudstack >> >> greeting Andreas >> >> >> >> >> Mit freundlichen Grüßen >> >> Andreas Huser >> Geschäftsführer >> System Engineer / Consultant >> (Cisco CSE, SMBAM, LCSE, ASAM) >> --------------------------------------- >> Zellerstraße 28 - 77654 Offenburg >> Tel: +49(781) 12786898 >> Mobil: +49(176) 10308549 >> ahuser@7five-edv.de >> >> >> >> >> ----- Ursprüngliche Mail ----- >> >> Von: "Outback Dingo" >> An: cloudstack-users@incubator.apache.org >> Gesendet: Dienstag, 23. Oktober 2012 02:15:16 >> Betreff: Re: Primary Storage >> >> On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguezwrote: >> >> >>> Solaris 11 ZFS and yes we tried different setups, raids levels number >> of SSD cache, ARC zfs options etc etc etc. >> >> Cheers >> >> >> VMWare ?? >> >> >> >> >> --=_36e98c6ae364616c0efa6c329989ab7f--