cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Davis <scr...@gmail.com>
Subject Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
Date Wed, 24 Oct 2012 13:44:39 GMT
ZFS for Linux is still somewhat in it's infancy, it's certainly not
performant... at least compared with ZFS running on FreeBSD or
Solaris/Illumos.

If you are a CentOS shop, then using LSI's controller cards with CacheCade
would be the route I would go. Phoronix has some decent benchmarks with ZFS
on Linux if you are interested:

http://www.phoronix.com/scan.php?page=article&item=linux_zfs_june2012&num=1

The standard NFS daemon seems to do well in my experience, however I must
say that I haven't had much in the way of actual experience with thousand+
clients or anything connecting to it.



On Wed, Oct 24, 2012 at 8:29 AM, Trevor Francis <
trevor.francis@tgrahamcapital.com> wrote:

> Seems my original post has taken a detour. Who knew storage could be so
> controversial!
>
> Since we are a Centos house and it would be nice to have a standard OS
> across our platform, but many people have been singing ZFS' praises. To
> bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS for
> linux.
> Does anyone have experience with this? Would it be better to deploy Centos
> with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or bcache?
> Also, is the standard linux NFS daemon sufficient to handle a significant
> load or do you recommend a different NFS daemon....or go iSCSI ?
> Our biggest concern is that we have most VMs doing many sequential small
> writes (log files).
>
> Thoughts before I make a decision?
>
>
>
> Trevor Francis****
> Partner
> 46 Labs | The PeerEdge Cloud****
> http://www.46labs.com | http://www.peeredge.net****
> 405-362-0046 - Voice  | 405-410-4980 - Cell****
> trevorgfrancis - Skype****
> trevor@46labs.com****
> ** **
> Solutions Provider for the Telecom Industry
>
> <http://www.twitter.com/peeredge> <http://www.twitter.com/peeredge> <http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
>
> On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fabrice.brazier@apalia.net>
> wrote:
>
> Hi,
>
> @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
> disks).
>
> After that a velociraptor disk it's only a 10k discs so we can count 120
> IOPs by disks. If we put them in a raid 5 (by the way everybody know that
> raid 5 don't scale very well after 12 disk the performance doesn't increase
> so much but nevermind you built your raid 5 of 30 disks).
> So you have a maximum of (30-1)*120=3480 iops for read and
> ((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
> 32Gb ram wich is peanuts to caching your read iops.
>
> So the only way you have to make performance is your cahecade.
> Now a cachecade is made of ssd (weird actually it's the same as zil and
> l2arc in zfs xD) plus the max size of a cachecade and the maximum size of a
> cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
> max.
>
> So you say your storage doesn't cost anything but you don't talk about the
> size and the price of your cachecade.
> And another thing the day you will have a real workload (not only 50
> Vms...)
> and this workload will be larger than 1Tb come again and explain us your
> very large performance.
>
> Me too I can built a zfs server with 4 zeusram in zil and 5 disks behind
> and
> make a benchmark on files of 2Gb and say hey see I have 500 000 Iops and
> 4Gb
> of throughput...
>
> Regards
>
> -----Message d'origine-----
> De : bruce.m@v365.com.au [mailto:bruce.m@v365.com.au]
> Envoyé : mercredi 24 octobre 2012 06:04
> À : cloudstack-users@incubator.apache.org
> Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
>
>
>
> These Sans we have build have saved us a fortune $ compared to FC Block I/O
> Sans from IBM, HP etc
>
> Ive been a contractor for IBM and HP
> for 20 years. I know how much they are $$$$$ and what they can do.
>
> A
> Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
> lucky you may get 100,000 perhaps 150,000 IOPS
>
> What ive build is
> better. cheaper, and frankly as i support Linux and SCST, much more
> reliable.
>
> the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam $800,
> Dual PSU $650, about $14,000 for a much superior SAN.
>
> As i explained another $15K for infiniband network. which does SRP and IP
>
> As for your issues with cache misses, the CacheCade doesn't Miss much from
> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
> primary 22KVA UPS It never fails.
>
> Also Its not
> doing very much unless a user in the cloud is hitting sequential blocks or
> R/W the same data repeatedly for some weird reason. Even then we see little
> lagg or slow down, when as you say its hitting the Disc.
> Infiniband with LSI Just rocks even more than we had thought!
>
> The R/W
> pattern from groups of 50 VM.s appear to be completely random, web servers,
> virtual firewalls, email servers, DNS etc, FTP, windows domains,
> sharepoint,
> MSSQL, MYSQL working on the same Lun which are not considered critical (as
> stated by clients) are each VM getting over 300-1200MB/s constantly even
> when other users are pulling or pushing a file copy @ 250Mb/s between VMs
> etc. Weve actually had to figure out how to slow down the SRP connections.
> If a user is provisioned with 2 Luns from each SAN, in windows you can copy
> between VM Drives @ 1220MB/s!!!!!
> faster than SSD in Linux its even faster, and that is in ESXi4.1 with
> 15~20 Servers per VM Host.
>
> we have snmp monitoring the 2 primary sans, and they never go above 30%
> bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
> cloud
> IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
> trial, so i expect even better performance. So long STORY SHORT. GET LSI
> AND
> 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
> SSD's its SLOWER! MORE SPINDLES = MORE SPEED!
>
> the main reason we wanted fast IO is for backups, snapshots, so as not to
> interfere with the daily operational workings of the cloud VM guests. Which
> is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
> work, and some smart ass does a restore or backup in prod hours, only to
> see
> the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
>
> With SRP on LSI " WE DONT
> HAVE THAT PROBLEM!"
>
> I did a snapshot restore for 2 machines yesterday for a client, it took 2
> mins to restore 100GB. And none of the other servers were affected. SNMP
> latency on all VM's no impact. Virtually 0CPU cycles on the VMware hosts
> for
> the IO move. IO latency/storage path for other VM's from snmp data from
> VMserver was 2ms which is normal. the machine doing the copy crept up to
> 10ms but no more.
>
> So my conclusion
> ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
>
> Cache misses we don't have a problem there at all. Were more worried about
> Data Rot, which is why were interested in GFS to ZFS Luns.
> But ZFS is still slow as compared to what we have.....
>
> we are going to
> launch a new script which moves a file full of 16-bit random io for 100MB,
> and 1TB, it will create a binary checksum for each file, and we can compare
> it over time to check for data rot.
>
> Just so you can
> appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
> the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
> the 2nd, while the 1st was running, same results across the 2 VM's. then
> the
> 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
> 220,000
> per host. so yes there is a penalty. But the way infiniband works is that
> the switch balances the load across all servers. 4th server online same
> test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s @
> 128K
> blocks 50%50% random seq. change the distribution to 100% random makes
> little difference. We confirmed all the drives were being accessed! all
> were. we even took a look at the camera out DC to monitor the array LEDS...
> alll Flickers like mad! we took a look at the metrics of the Megaraid as we
> got it to work on CentOS, no errors, no issues or problems, no dropped
> reads
> or writes.
>
> we set the file on IO meter as high as poss 900000000 units. We then fired
> up 2 more hosts and did real world file copy, no real impact still get
> 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
> file copies between the 2 sans and each host blew us away at 2000MB/s read
> and write across all 6 hosts.!! FXXK we were amazed
>
> Aside
> from that LSI have done a super job on the Cache modules on their cards.
> I dont see how ZFS will ever approach this kind of performance.
>
> Anyway
> take from what ive said what you like. We have an awsum SAN... and its
> CHEAP
> and Amazingly fast!
>
> were in the process of developing a public Cloud, so when the web site is
> finn we will publish all of this in a whitepaper there.
>
> On 24.10.2012 10:57, Jason Davis wrote:
>
> Passionate
>
> are you? Take it easy I am not knocking on what you are doing.
>
>
> The
>
> term "spinning rust" is a catch all for hard drives, nothing personal
>
>
> at all, in fact the setup you have is impressive.
>
>
> Impressive until
>
> you have a cache miss. Cache misses are definitely an
>
> issue even in
>
> "capital E" enterprise storage.
>
>
> Big bad XIV disk arrays or 3par are
>
> impressive, fast arrays (crazy
>
> expensive ;]) when you are hitting DRAM
>
> or SSD but if you are running cache
>
> unfriendly workloads then
>
> ultimately your array is only as fast as the
>
> spinning disk sitting at
>
> the lowest tier.
>
>
> I agree that openstorage is the way to go, from a
>
> service providers
>
> perspective it's hard being competitive with AWS or
>
> RackSpace if you are
>
> using expensive traditional SAN.
>
> So anyway,
>
> my question wasn't a poke at what you are doing, just curious
>
> about
>
> how big of an IOP/latency drop you see once a read or write has to
>
>
> actually see the spinning disk. :)
>
> On Oct 23, 2012 8:10 PM, wrote:
>
>
>
>
> oh well if your negative, and make rude comments then that's
>
> your
>
> problem, i don't care about you.
>
> If you don't believe that
>
> RDMA can
>
> perform this fast then again, your problem.
>
> Ask anyone
>
> with a decent
>
> 12+ Disc Raid Array, and RDMA and they will tell you it
>
> will kick you in
>
> YOUR ballz! Anyone in storage knows to get iops and
>
> performance you need
>
> 12 preferably 16 spindles.
>
> the spinning
>
> rust as you call it, are the
>
> newest 200MB/s WD Veloci Raprors with
>
> 64MB Cache. @ 10KRPM 2.5" dics in
>
> 32 Bay Storage Chassis.
>
> we've
>
> used them in the past and they are as
>
> reliable as the SAS drives we
>
> use, but faster!
>
>
> RAID 10 IS BULLSHIT!
> Bloody slow and 50% goes
>
> up in smoke for nothing. All you get is the
>
> mirror speed of 2 drives
>
> which is barley 400MB/s tops! I Spit on RAID 10
>
>
>
> We also tried
>
> Raid 60 and it was very good as well. But no point to
>
> use it as we
>
> replicated to 2nd SAN
>
>
> When we tested our new SANs, we
> failed 1
>
> drive and set the rebuild rate to 30% on raid 6, with 40% scan
>
> rate.
>
> The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>
>
> removed a 2nd disc, and replaced with blank, and it was rebuild in
>
>
> another 4 hours. With no real impact to performance tests.
>
>
>
> Separating
>
> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
>
> 3Gb/s Channels
>
> using a SAS Expander gives each Disc 300MB/s of
>
> bandwidth. They cant
>
> physically perform better than 200~220MB/s so
>
> there is enough bandwidth
>
> on the SATA and PCIE Bus to cope. LSI rate
>
> the 9280-8i @ 2500MB/s but it
>
> does work faster with Cache Cade. up to
>
> 3200MB/s in that test.
>
>
> So a
> real SAN has many spindles of high
>
> performance, and the WD VRs are
>
> better than some cheap SSD
>
> drives.
>
>
> RDMA is very fast, low cpu cycles,
> and reads and write
>
> directly to RAM@ 40Gb/s, we created a RAM drive and
>
> tested it up to
>
> 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>
> Handle.
>
>
> If you cant afford this realtivley cheap equipment (compated to
>
> IBM
>
> DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>
>
> performance. Any real Cloud would have what we have built. None of our
>
> customers, ALL 300 Of them! have lost a single bit of data, all
>
> VMs are
>
> running very quickly, and with no problems now for 3
>
> months.
>
>
> So If you
> dont appreciate being shown how to build a
>
> kick ASS SAN then PISS
>
> OFF!
>
> On 24.10.2012 08:17, Jason Davis
>
> wrote:
>
> How often do you have
>
> folks cache missing and hitting the
>
> spinning rust
>
> below the cache cade
>
> tier?
>
> On Oct 23, 2012
>
> 6:25 PM, wrote:
>
>
>
>
> Hi
>
> with our
>
> SCST
>
> RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>
> Director 4036
>
>
> switches + HP Gen2 QDR Chassis Switches.
>
>
> Our San achieves
>
>
>
> 295,000 - 350,000 Iops max per LSI controller, depending on
>
> block
>
> size
>
> for i/o meter test. LSI9280-8i with battery cache.
>
> +
>
> cachecade2.0 x 2 in
>
> PCIE 16x slot on supermicro
>
> motherboard.
>
>
> 1
>
> x 250GB SSD for CACHE Cade +
>
> 16 1tb
>
> WD VR drives (16 - hot stand buy
>
> - 2 for raid 6 parity) =
>
>
> 2600MB/s
>
>
> most of our test are from 4K
>
> to 128K block size,
>
> @ 4K we get
>
> 295,000 Iops @ 1590 MB/s and @ 128K
>
> 350,000 iops @
>
> 2400~2600MB/s
>
>
> we
> have tuned 64K and 128K block
>
> size
>
> on different luns. 64K for database
>
> and 128K for general
>
>
> file.
>
>
> The best thing about Infiniband is low CPU
>
> cycles. only
>
> 5~6% during these tests.
>
>
> latency is as low
>
> as 4-6ms
>
> average
>
> read time. concurrent response times are from
>
> 5-12ms. even under
>
>
> heavy load its below 20ms
>
>
>
> Infiniband latency is below 0.01 us
>
> Which is
>
> why we chose
>
> it.
>
>
> we run MSSQL on the 64K formatted
>
> Luns, its
>
> massively
>
> fast.
>
> If we copy the same data twice the
>
>
> cache cade kicks in and we
>
> achieve even better speeds.
>
>
> Ive
>
> compared this to block io @ 8Gb/s Fiber
>
> channel, and it
>
> barely gets
>
> 120,000 IOPS. @ much higher latency and
>
>
> bandwidth.
>
>
> So for our
>
> money RDMA wins!
>
>
>
> Bruce
>
>
> On 23.10.2012 21:20,
> Jason Davis
>
> wrote:
>
>
> Out of curiosity, is there any quick performance
>
> numbers
>
> for
>
> these ZFS +
>
> GlusterFS mashups you guys are talking
>
>
> about?
>
>
>
> Specifically, IOPs and latency? Sequential
>
> read/write
>
>
> performance honestly
>
> isn't a very good
>
> benchmark to determine
>
> your
>
> SANs performance. It's like
>
>
> comparing CPUs based solely on
>
> how many
>
> GHz it runs at. Sure
>
> you can get
>
> great MB or GB/s with
>
> SATA disk but
>
> I'd
>
> reckon that IOP performance is
>
> abismal. If you
>
> are
>
> utilizing
>
> GlusterFS without the cache pooling magic
>
> that
>
> is
>
> ZFS then I would
>
> imagine that latency can be an issue.
>
>
>
>
>
> On Tue, Oct 23, 2012 at
>
> 7:56 AM,
>
> Andreas Huser wrote:
>
>
>
> Hi Fabrice,
>
>
> i
>
> know
>
> OpenSolaris/Solaris Oracle it's so
>
> a thing.
>
> I'm
>
> for more then 10
>
> years a open source user and
>
> that
>
>
> with oracle - i did no like at the
>
> beginning of this
>
>
> constallation.
>
> But Oracle makes his work good i
>
> know
>
> that. The
>
> cost of one socket
>
> are 700$ and you can use so
>
> much
>
> quantity of
>
> TB as you will.
>
> And you can use the
>
> full premier Support
>
> from
>
> Oracle.
>
> Nexenta develop with
>
> the Illumos code. And the Licence
>
>
> are TB based.
>
> That
>
> is not my favorite. As well the pool version
>
> from
>
> Nexenta
>
> comes
>
> not after. Current Nexenta Infiniband are
>
> not a
>
> usable
>
> solution.
>
> But every can use what he will.
>
> Everyone
>
> must decide for
>
> themselves.
>
>
> SRP
>
> Targets or iser are not
>
> difficult to configure.
>
> Use the SRP
>
> for
>
> the Storage unit
>
> connection. Solaris and GlusterFS
>
>
> builds one Storage unit.
>
> The
>
> GlusterFS Server export the
>
> final Volume
>
> to the Clients as well
>
> KVM,
>
> VMWare,
>
> Hyper-V etc.
>
> You can use
>
> native GlusterFS,
>
> RDMA, NFS
>
> ore CIFS to export the Volume.
>
> SRP have
>
> nothing to do
>
>
> with VmWare.
>
>
> When you are use a 7200 SAS drive the
>
>
> access
> time are the same as a
>
> SATA drive only the quality
>
> of the
>
>
> hardware are better. When you need
>
> Performance
>
> you must use SAS
>
>
> drives with 15000U/m. But it's not needed
>
> when you install SSD
>
> for
>
> ZIL/L2ARC. ZeusRAM rocks :-)
>
>
>
> I use dedup only at
>
> secondary stroage
>
> or on
>
> Backupserver not on Primary
>
> Storage.
>
>
> When you use
>
> SSD SATA
>
> drives then you have an cheap an fast
>
> strorage.
>
>
> 1TB drive cost unter
>
> 100$. Current i'm not need to
>
> save
>
> storage volume.
>
>
> Which
>
> application use atime? I
>
> know
>
> "find -atime N". atime on a Storage
>
>
> that stores
>
> only Virtual
>
> Disks?
>
> I don't need the information when i
>
>
> have last opening
> the disk :-)
>
>
> For a Solaris GUI
>
> take a look at
>
> napp-it
>
> http://www.napp-it.org/
>
>
> greetings from germany
>
>
> Andreas
>
>
>
>
>
>
> ----- Ursprüngliche Mail -----
>
>
>
>
> Von:
>
> "Fabrice Brazier">> An:
>
>
> cloudstack-users@incubator.apache.org
>
>
> Gesendet: Dienstag,
>
> 23.
>
> Oktober 2012 12:30:50
>
> Betreff: RE: Primary
>
>
> Storage
>
>
>
> Hi Andreas,
>
>
> Hum that's
>
> pretty cool, i know they have
>
>
> still trouble with infiniband
>
> on
>
> nexenta but it's clearly a
>
> priority
>
> on their
>
> roadmap (plus i trust more the
>
> nexenta team
>
> to evolve zfs
>
> than oracle).
>
>
> I agree iscsi over infiniband
>
>
> increase the latency
>
> but most of the time
>
> it's
>
> just more
>
> simple to use IPOIB than IB,
>
> for example if you use
>
> vmware the
>
>
> support of infiniband it's assured
>
> by
>
> mellanox and not vmware so if
>
> you
>
> have
> an issue the
>
> vmware
>
> support probably don't help
>
> you anymore.
>
>
>
> About the raid i'm a fan
>
> of raid 10, i
>
> prefer build a raid
>
> 10 with 7200 sas
>
> drive than
>
> multiple vdev
>
> raidz/raidz2
>
> 15k sas drive. Particularly for
>
> virtual
>
>
> environment
>
> with a random workload and multiple access.
>
>
> I'm
>
> a
>
> fan
>
> of nfs so agreed about the zil, and zeus ram are the only
>
> one
>
> for
>
>
> that (with classical ssd you can hit the write
>
> hole).
>
>
>
> Agreed for
>
> compression too (but only
>
> lzjb level gzip use too
>
> much cpu),
>
> disable
>
> atime
>
> permit to decrease the iops load but
>
> i'm not sure is really a
>
>
>
> best practice. About the deduplication
>
> I don't totally
>
> agree. It
>
> really
>
> depends of your array, the
>
> workload
>
> and the VM type on your
>
> cloudstack.
>
> Actually when i
>
>
> build a zfs array i count 1Gb ram for 1Tb
>
> disks. With
>
>
>
> deduplication i count 2Gb ram for 1 Tb disks (plus in
>
> nexenta 4
>
> they
>
> will
>
> add a possibility to use ssd for the
>
>
> deduplication metada,
>
> like the l2arc
>
> for the read cache).
>
>
>
> The
>
> last point
>
> it's about your drives, why sata
>
> drives? I mean sata
>
> doesn't
>
>
> respect their latency, and
>
> the reliability of a sas drive is
>
> 10x the
>
> sata.
>
> Plus
>
> now you can found many NL-sas drive at low
>
> cost
>
>
>
> But it's really a nice architecture, i never try glusterfs
>
>
>
> (for the moment)
>
> plus in that case it's really a good way to
>
> have
>
> a
>
> replacement to a
>
> metro-cluster for free, i try
>
> one of is
>
> competitor
>
> (onefs) and clearly
>
> clustered
>
> filesystem are the
>
> futur.
>
>
>
> Cheers,
>
>
> Fabrice
>
>
> -----Message
>
> d'origine-----
>
> De :
>
> Andreas
>
> Huser
>
> [mailto:ahuser@7five-edv.de]
>
> Envoyé :
>
> mardi 23 octobre 2012
>
>
> 11:37
>
> À :
>
> cloudstack-users@incubator.apache.org
>
> Objet :
>
> Re:
>
>
> Primary Storage
>
>
> Hi Fabrice,
>
> I don't
>
> know
>
> what other people
>
> do but i have no problems with
>
> Infiniband +
>
>
> GlusterFS + Cloudstack
>
> I'm not use
>
> Nexenta it's based on Illumos and
>
> work
>
> not fine with
>
>
> Infiniband.
>
> I have two different
>
> clusters in productiv
>
>
> enviroments.
>
>
> The first: Solaris 11
>
> with built-in
>
> GlusterFS 3.3
>
> export Gluster Vol. with
>
> RDMA. >
>
>
> performance is okey you can use that
>
> for smaller enviroments
>
> The
>
>
> Second: Is a little bit complex with a GlusterFS
>
> Server in the
>
> middle
>
>
> ZFS Server: based on Solaris
>
> 11
>
>
> 1.) Create a
>
> zpool min. two vdevs and SSD
>
> read/write cache
>
> 2.) Create
>
> a thin
>
> provisioning Volume
>
> "zfs create -V" disable atime and
>
> enable
>
>
> compression
>
> (do not enable dedup!) and export as (iWarp) SRP Target
>
>
>
>
> to the GlusterFS Server use a direct connection without IB
>
> Switch.
>
>
>
> GlusterFS Server:
>
> 1.) Use
>
> RedHat, CentOS or
>
> Fedora (I use CentOS 5
>
> and 6)
>
> 2.)
>
> Use OFED driver from
>
> https://www.openfabrics.org
>
> 3.)
>
>
> Import the SRP Target from ZFS
>
> Server and format as xfs
>
> 4.)
>
> Create a
>
> Glustervolume "volume
>
> create xy transport rdma " (use
>
> only
>
> rdma)
>
> 5.) Connect
>
> with the second IB
>
> Port to a IB Switch
>
>
> Cloudstack
>
>
> Hypervisor
>
> Node:
>
> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>
>
> and
>
> 6)
>
> 2.) Use OFED driver from
>
> https://www.openfabrics.org
>
>
> 3.) Import
> the
>
> Glustervolume
>
>
> ZFS Thin Volume ----
>
> Infinband SRP
>
> ---->
>
> GlusterFS ---- GFSVol rdma ---->
>
> IB Switch
>
> ---->
>
> Clients
>
>
> The
>
> ZFS and GlusterFS Server form a
>
> Storage
>
> unit connect direct with
>
> 40Gbit
>
> Infiniband
>
> Point-to-Point You
>
> do not feel that is a cable
>
> between!
>
>
> Important: When you
>
> have Infiniband use not IPoIB with
>
> iSCSI! If one
>
> already
>
> has Infiniband then you
>
> should also use the
>
> advantage.
>
> IPoIB
>
> have a higher
>
> Latency as iWarp SRP!
>
>
>
> SRP =
>
> usec
>
>
>
>
> -- SRP --
> local address: LID 0x01 QPN
>
> 0x44004b PSN
>
>
> 0xf3265b RKey 0x9804237c VAddr
>
>
> 0x00000001dda000 remote address:
>
> LID
>
> 0x0a QPN 0x10004a PSN
>
> 0x44072e RKey
>
> 0x1c0f115 VAddr
>
>
> 0x000000088e6000
>
>
>
>
>
> ------------------------------------------------------------------
>
>
>
>
> #bytes #iterations t_min[usec] t_max[usec]
>
> t_typical[usec]
>
> 2
>
> 1000
>
> 1.29 125.30 1.31
>
>
>
>
>
> ------------------------------------------------------------------
>
>
>
>
>
> -- IPoIB ---
>
> [root@sv01sfogaa ~]# ping
>
> 10.200.0.10
>
>
> PING
> 10.200.0.10 (10.200.0.10) 56(84)
>
> bytes of data.
>
> 64
>
> bytes from
>
> 10.200.0.10: icmp_seq=1
>
> ttl=255 time=0.147 ms
>
> 64
>
> bytes from
>
> 10.200.0.10:
>
> icmp_seq=2 ttl=255 time=0.116 ms
>
>
>
> When you put load
>
> on
>
> IPoIB then increases the latency that is not
>
> good
>
>
>
>
> That is my
>
> recommendation for a simple
>
>
> GlusterFS Mirror:
>
>
> - Supermicro Server
>
> with Intel
>
> Hardware
>
> and Expander Backplane
>
> - 1x Crucal-M4 SSD Read
>
>
> Cache
>
> - 2x
>
> ZeusIOPs SSD Write cache (mirror)
>
> -
>
> SATA 24/7
>
> Harddrives
>
>
> - LSI HBA 9207 or 9211
>
> -
>
> ConnectX-2 QDR Dualport
>
> Infiniband
>
> Adaper (HP refurbished with
>
> full
>
> warrany for 100$)
>
> Importent >
>
> Flash newest
>
> Firmware from Mellanox!
>
> - Mellanox IB
>
> Swich
>
>
> -
>
> Solaris 11
>
> - GlusterFS 3.3 compiled with ib_verbs
> -
>
>
>
> Gluster Volume transport only rdma
>
>
>
>
>
>
>
>
>
> Throughput
>
> constant up to 200
>
> Mbyte/s >> more throughput with
>
> more storage servers
>
> or more
>
> harddrives on jbod >> >> >>
>
>
>
>
>
> Info:
>
>
>
> - I have some
>
> problems with infiniband
>
> RDMA or
>
> SRP with OpenIndiana or
>
> Illumos or
>
> Nexenta. Some
>
>
> adapters have a high latency or not a stable
>
>
> connection.
>
> Use
>
> Solaris that's the right way!
>
> - OpenIndiana is Beta!
>
>
> Infiniband
> ib_verbs works not or not fine!
>
> - Use Solaris
>
> 11
>
> Infiniband
>
> ib_verbs are native and stable
>
> - Don't
>
> use Ubuntu Client
>
> Server
>
> for Infiniband! Use RedHat, Fedora
>
> or
>
> CentOS and install the
>
>
> right drivers from
>
>
> https://www.openfabrics.org/downloads/OFED/
>
>
> -
> You have
>
> not SSD Cache? Disable at the ZFS Volume the sync!
>
>
>
> Important!
>
> You
> loose Security for your Data but
>
> some
>
> protocols use
>
> sync flags in
>
> transport.
>
> Example NFS use by
>
> default fsync. write
>
> cache are not active.
>
> NFS writes
>
> data
>
> direct to the Harddrive. For
>
> Data
>
> Security and Performance give
>
> the
>
> Storage Server a SSD
>
> write
>
> cache. ZFS works at default with
>
> sync=standard
>
>
> that prevent write
>
> holes. (COW System)
>
>
>
> I hope
>
> that I could help a little
>
>
>
> Greeting from
>
>
> Germany
>
> Andreas
>
>
>
> -----
>
> Ursprüngliche
>
> Mail
>
> -----
>
>
> Von: "Fabrice
>
> Brazier"
>
> An:
>
>
> cloudstack-users@incubator.apache.org
>
> Gesendet: Dienstag, 23.
>
> Oktober
>
> 2012 09:55:15
>
>
> Betreff: RE: Primary Storage
>
>
>
> Hi Andreas,
>
>
>
>
> Hello i just see your configuration, it seems
>
> quite
>
> interesting.
>
> If i
>
> understand well you want to build
>
> some
>
> zfs array on the backend.
>
>
> Export luns (probably
>
> by iscsi over
>
> infiniband) to you linux cluster,
>
> and
>
>
> on
>
> the linux
>
> cluster you put glusterFS.
>
> I can
>
> understand
>
> the point, with
>
> that you can have very good
>
> performance
>
> and
>
> reliability
>
> (zfs),
>
> scalability and redundancy (gluster) for very low
>
> cost.
>
>
> So just one question, did you try the global namespace
>
>
> implementation
>
> from
>
> nexenta?
> If yes can you tell me
>
> what
>
> configuration
>
> is the best for you?
>
> I mean the
>
> fact you have a gluster
>
> cluster
>
> in the middle must impact
>
> the
>
> overral performance no?
>
>
>
>
> Fabrice
>
>
> -----Message d'origine-----
> De :
>
> Andreas
>
> Huser
>
> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
>
> octobre
>
>
> 2012 05:40 À :
>
>
> cloudstack-users@incubator.apache.org
>
> Objet :
>
> Re: Primary
>
> Storage
>
>
> Hi,
>
> for Cloudstack i
>
> use
>
> Solaris 11 ZFS + GlusterFS
>
> over Infiniband (RDMA).
>
>
> That
>
>
> gives the best performance and most
>
> scalable
>
> Storage.
>
> I have
>
> tasted some different solutions for
>
> primary
>
> Storage but the most
>
> are
>
> to expensive and for
>
> a CloudStack Cluster
>
> not economic or
>
> have a poor
>
>
> performance.
>
>
> My Configuration:
>
>
>
> Storage Node:
>
> Supermicro Server (Intel Hardware) with Solaris
>
> 11
>
> with
>
> SSD write and read
>
> cache (read crucial-m4,
>
> write ZeusIOPS)
>
> GlusterFS
>
> and dualport ConnectX
>
>
> 40Gbit/s Infiniband adapter.
>
>
>
> I have
>
> installed
>
> GlusterFS direct on Solaris with a modified
>
> code.
>
> Want
>
> you
>
> build bigger systems for more then 50 VMs it is
>
> better you
>
> split
>
> the
>
> Solaris and GlusterFS with a separte
>
>
> headnode for GlusterFS
>
>
> That
>
> looks like:
>
>
> Solaris ZFS
>
> Backendstorage with a dataset Volume (Thin
>
>
> Provision) --> ( SRP
>
>
> Target attached direct without
>
> Infiniband switch
>
> to GF Node) -->
>
> GlusterFS
>
> Node the
>
> srp target formatted with xfs
>
> filesystem,
>
> create a GlusterFS
>
> Volume --> ( Infiniband over a
> Mellanox Port
>
> Switch) -->
>
> Cloudstack Node
>
> mount glusterFS Volume over
>
> RDMA
>
>
>
> For the Dataset Volume at the ZFS Storage, disable atime
>
> and
>
>
> enable
>
> compression.
> (Space reclaim) With
>
> compression you
>
> can
>
> shrink the ZFS Volume with command
>
>
> at Linux dd /dev/zero or
>
> In a
>
> Windows VM with sdelete That
>
> gives you space
>
> left on the
>
> Primary
>
> Storage for
>
> deleted Files in a VM or for deleted vhd's
>
>
> or vm's in the
>
> cloudstack
>
>
> greeting Andreas
>
>
>
>
>
>
> Mit freundlichen
>
> Grüßen
>
>
>
> Andreas Huser
>
>
> Geschäftsführer
>
> System Engineer
>
> /
>
> Consultant
>
> (Cisco
>
> CSE, SMBAM, LCSE, ASAM)
>
>
>
>
> ---------------------------------------
>
>
> Zellerstraße 28 - 77654
>
>
> Offenburg
>
> Tel: +49(781)
>
> 12786898
>
> Mobil: +49(176) 10308549
>
>
>
>
> ahuser@7five-edv.de
>
>
>
>
>
>
> -----
>
> Ursprüngliche Mail -----
>
>
>
> Von: "Outback
>
> Dingo"
>
> An:
>
> cloudstack-users@incubator.apache.org
>
>
>
> Gesendet: Dienstag, 23.
> Oktober 2012 02:15:16
>
> Betreff:
>
> Re: Primary
>
> Storage
>
>
>
> On Mon, Oct 22, 2012 at
>
> 8:09 PM, Ivan
>
> Rodriguezwrote:
>
>
>
>
>
>
> Solaris 11 ZFS and yes we tried different
>
> setups, raids
>
> levels
>
> number >> of SSD cache, ARC zfs options etc etc
>
> etc. >> >>
>
>
> Cheers >> >> >> VMWare ?? >>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message