cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clayton Weise <cwe...@iswest.net>
Subject RE: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
Date Wed, 24 Oct 2012 18:20:10 GMT
You're going to see a dip in performance with KVM primarily because of the qcow2 format (KVM
people please correct me if I'm wrong because I'm new to KVM).  Due to time constraints I
wasn't able to run any tests with writeback cache turned on or off but I've seen people saying
both in forums that having it on improves performance, and having it off improves performance
to who knows.

What I can tell you is that in my tests with our current setup I was able to achieve about
~4-5k IOPS for random reads/writes using a multitude of different benchmark utilities in both
Linux and Windows with KVM's raw disk format.  When I switched over to qcow2 it went down
to about 2-2.5k.  With the way our test RAID system is constructed our theoretical max is
about 5k IOPS so I'm not sure if the drop is something related to the overall potential (e.g.
50% of the total) or simply a hard cap of that number but I haven't had a chance to really
kick things around much.  This test was done using an iSCSI LUN mounted from a Compellent
SAN on 10gbit Ethernet and OCFS2 on top of it.  I know there was a comment earlier saying
that OCFS2 was old but what I can say about it is that it was extremely easy to configure
and get up and running and seemed to be pretty solid when we tested it about a year ago as
well as the other week.

Regarding your ZFS on Linux question.  ZFS on Linux just isn't there yet.  It's too new and
the work that's being done on it right now is primarily around stability and integrating features
and not about performance so if you're going to do ZFS do it on either BSD or Solaris/Illumos.
 The alternative to that is the hardware equivalent with something like LSI's cachecade which
will do basically the same thing and it will work just fine with Linux.

From: Trevor Francis [mailto:trevor.francis@tgrahamcapital.com]
Sent: Wednesday, October 24, 2012 8:27 AM
To: cloudstack-users@incubator.apache.org
Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

So, since ZFS is super slow on Centos, would it be better to use ext4 and bcache over the
native linux NFS daemon?

Also, I am reading the KVM is much more disk performant using writeback caching. Does Cloudstack
deploy the KVM VM instances with this enabled by default?



Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com<http://www.46labs.com/> | http://www.peeredge.net<http://www.peeredge.net/>
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com<mailto:trevor@46labs.com>

Solutions Provider for the Telecom Industry

[cid:image001.jpg@01CDB1D8.F6D21DB0]<http://www.twitter.com/peeredge> <http://www.twitter.com/peeredge>
[cid:image002.jpg@01CDB1D8.F6D21DB0]

On Oct 24, 2012, at 8:50 AM, Nik Martin <nik.martin@nfinausa.com<mailto:nik.martin@nfinausa.com>>
wrote:


On 10/24/2012 08:29 AM, Trevor Francis wrote:

Seems my original post has taken a detour. Who knew storage could be so
controversial!

Since we are a Centos house and it would be nice to have a standard OS
across our platform, but many people have been singing ZFS' praises. To
bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS
for linux.
Does anyone have experience with this? Would it be better to deploy
Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or
bcache?


I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote I got was "ZFS
is slow on Linux". I then asked "Slow relative to what, ZFS on Solaris, ext4 on Linux, ZFS
on Linux?", and the response was "yes", so I'm being led to believe that it does not perform
as well as other filesystems.  For large file storage on Linux, which Xen VHDs are, I tend
to use XFS if the network protocol is NFS.


Also, is the standard linux NFS daemon sufficient to handle a
significant load or do you recommend a different NFS daemon....or go iSCSI ?
Our biggest concern is that we have most VMs doing many sequential small
writes (log files).

When I first started with CS on Xenserver in January, I thought thin provisioning was important,
so I built file system based storage using NFS, with XFS as the underlying filesystem, on
LVM.  NFS (v3) on Centos 6 became processor bound VERY quickly, running %300 CPU load on a
quad core Xeon.

I switched to iSCSI with LVM as backing store, and havwe had zero issues, except I'm currently
investigating lower than desired random read/write on < 4k random.  Using tests like Iozone
in a VM, and running iostat -t1 on the hypervisor console itself, i can get sustained 2500
IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to resolve this soon
though.



Thoughts before I make a decision?



Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
<http://www.peeredge.net/>
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com<mailto:trevor@46labs.com> <mailto:trevor@46labs.com>
Solutions Provider for the Telecom Industry

<http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>

On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fabrice.brazier@apalia.net<mailto:fabrice.brazier@apalia.net>
<mailto:fabrice.brazier@apalia.net>> wrote:


Hi,

@bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
disks).

After that a velociraptor disk it's only a 10k discs so we can count 120
IOPs by disks. If we put them in a raid 5 (by the way everybody know that
raid 5 don't scale very well after 12 disk the performance doesn't
increase
so much but nevermind you built your raid 5 of 30 disks).
So you have a maximum of (30-1)*120=3480 iops for read and
((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
32Gb ram wich is peanuts to caching your read iops.

So the only way you have to make performance is your cahecade.
Now a cachecade is made of ssd (weird actually it's the same as zil and
l2arc in zfs xD) plus the max size of a cachecade and the maximum size
of a
cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
max.

So you say your storage doesn't cost anything but you don't talk about the
size and the price of your cachecade.
And another thing the day you will have a real workload (not only 50
Vms...)
and this workload will be larger than 1Tb come again and explain us your
very large performance.

Me too I can built a zfs server with 4 zeusram in zil and 5 disks
behind and
make a benchmark on files of 2Gb and say hey see I have 500 000 Iops
and 4Gb
of throughput...

Regards

-----Message d'origine-----
De : bruce.m@v365.com.au<mailto:bruce.m@v365.com.au> <mailto:bruce.m@v365.com.au>
[mailto:bruce.m@v365.com.au<http://v365.com.au> <http://v365.com.au>]
Envoyé : mercredi 24 octobre 2012 06:04
À : cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org>
<mailto:cloudstack-users@incubator.apache.org>
Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT



These Sans we have build have saved us a fortune $ compared to FC
Block I/O
Sans from IBM, HP etc

Ive been a contractor for IBM and HP
for 20 years. I know how much they are $$$$$ and what they can do.

A
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
lucky you may get 100,000 perhaps 150,000 IOPS

What ive build is
better. cheaper, and frankly as i support Linux and SCST, much more
reliable.

the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam
$800,
Dual PSU $650, about $14,000 for a much superior SAN.

As i explained another $15K for infiniband network. which does SRP and IP

As for your issues with cache misses, the CacheCade doesn't Miss much from
our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
primary 22KVA UPS It never fails.

Also Its not
doing very much unless a user in the cloud is hitting sequential blocks or
R/W the same data repeatedly for some weird reason. Even then we see
little
lagg or slow down, when as you say its hitting the Disc.
Infiniband with LSI Just rocks even more than we had thought!

The R/W
pattern from groups of 50 VM.s appear to be completely random, web
servers,
virtual firewalls, email servers, DNS etc, FTP, windows domains,
sharepoint,
MSSQL, MYSQL working on the same Lun which are not considered critical (as
stated by clients) are each VM getting over 300-1200MB/s constantly even
when other users are pulling or pushing a file copy @ 250Mb/s between VMs
etc. Weve actually had to figure out how to slow down the SRP connections.
If a user is provisioned with 2 Luns from each SAN, in windows you can
copy
between VM Drives @ 1220MB/s!!!!!
faster than SSD in Linux its even faster, and that is in ESXi4.1 with
15~20 Servers per VM Host.

we have snmp monitoring the 2 primary sans, and they never go above 30%
bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
cloud
IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
trial, so i expect even better performance. So long STORY SHORT. GET
LSI AND
16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
SSD's its SLOWER! MORE SPINDLES = MORE SPEED!

the main reason we wanted fast IO is for backups, snapshots, so as not to
interfere with the daily operational workings of the cloud VM guests.
Which
is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
work, and some smart ass does a restore or backup in prod hours, only
to see
the whole cluster slow down to a crawl. on iSCSI its even more pathetic.

With SRP on LSI " WE DONT
HAVE THAT PROBLEM!"

I did a snapshot restore for 2 machines yesterday for a client, it took 2
mins to restore 100GB. And none of the other servers were affected. SNMP
latency on all VM's no impact. Virtually 0CPU cycles on the VMware
hosts for
the IO move. IO latency/storage path for other VM's from snmp data from
VMserver was 2ms which is normal. the machine doing the copy crept up to
10ms but no more.

So my conclusion
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!

Cache misses we don't have a problem there at all. Were more worried about
Data Rot, which is why were interested in GFS to ZFS Luns.
But ZFS is still slow as compared to what we have.....

we are going to
launch a new script which moves a file full of 16-bit random io for 100MB,
and 1TB, it will create a binary checksum for each file, and we can
compare
it over time to check for data rot.

Just so you can
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
the 2nd, while the 1st was running, same results across the 2 VM's.
then the
3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
220,000
per host. so yes there is a penalty. But the way infiniband works is that
the switch balances the load across all servers. 4th server online same
test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s
@ 128K
blocks 50%50% random seq. change the distribution to 100% random makes
little difference. We confirmed all the drives were being accessed! all
were. we even took a look at the camera out DC to monitor the array
LEDS...
alll Flickers like mad! we took a look at the metrics of the Megaraid
as we
got it to work on CentOS, no errors, no issues or problems, no dropped
reads
or writes.

we set the file on IO meter as high as poss 900000000 units. We then fired
up 2 more hosts and did real world file copy, no real impact still get
2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
file copies between the 2 sans and each host blew us away at 2000MB/s read
and write across all 6 hosts.!! FXXK we were amazed

Aside
from that LSI have done a super job on the Cache modules on their cards.
I dont see how ZFS will ever approach this kind of performance.

Anyway
take from what ive said what you like. We have an awsum SAN... and its
CHEAP
and Amazingly fast!

were in the process of developing a public Cloud, so when the web site is
finn we will publish all of this in a whitepaper there.

On 24.10.2012 10:57, Jason Davis wrote:

Passionate
are you? Take it easy I am not knocking on what you are doing.


The
term "spinning rust" is a catch all for hard drives, nothing personal


at all, in fact the setup you have is impressive.


Impressive until
you have a cache miss. Cache misses are definitely an

issue even in
"capital E" enterprise storage.


Big bad XIV disk arrays or 3par are
impressive, fast arrays (crazy

expensive ;]) when you are hitting DRAM
or SSD but if you are running cache

unfriendly workloads then
ultimately your array is only as fast as the

spinning disk sitting at
the lowest tier.


I agree that openstorage is the way to go, from a
service providers

perspective it's hard being competitive with AWS or
RackSpace if you are

using expensive traditional SAN.

So anyway,
my question wasn't a poke at what you are doing, just curious

about
how big of an IOP/latency drop you see once a read or write has to


actually see the spinning disk. :)

On Oct 23, 2012 8:10 PM, wrote:




oh well if your negative, and make rude comments then that's
your

problem, i don't care about you.

If you don't believe that
RDMA can

perform this fast then again, your problem.

Ask anyone
with a decent

12+ Disc Raid Array, and RDMA and they will tell you it
will kick you in

YOUR ballz! Anyone in storage knows to get iops and
performance you need

12 preferably 16 spindles.

the spinning
rust as you call it, are the

newest 200MB/s WD Veloci Raprors with
64MB Cache. @ 10KRPM 2.5" dics in

32 Bay Storage Chassis.

we've
used them in the past and they are as

reliable as the SAS drives we
use, but faster!


RAID 10 IS BULLSHIT!
Bloody slow and 50% goes
up in smoke for nothing. All you get is the

mirror speed of 2 drives
which is barley 400MB/s tops! I Spit on RAID 10



We also tried
Raid 60 and it was very good as well. But no point to

use it as we
replicated to 2nd SAN


When we tested our new SANs, we
failed 1
drive and set the rebuild rate to 30% on raid 6, with 40% scan

rate.
The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we


removed a 2nd disc, and replaced with blank, and it was rebuild in


another 4 hours. With no real impact to performance tests.


Separating

the underlying 6Gb/s Sata ports x 8 into effectively 16 x
3Gb/s Channels

using a SAS Expander gives each Disc 300MB/s of
bandwidth. They cant

physically perform better than 200~220MB/s so
there is enough bandwidth

on the SATA and PCIE Bus to cope. LSI rate
the 9280-8i @ 2500MB/s but it

does work faster with Cache Cade. up to
3200MB/s in that test.


So a
real SAN has many spindles of high
performance, and the WD VRs are

better than some cheap SSD
drives.


RDMA is very fast, low cpu cycles,
and reads and write
directly to RAM@ 40Gb/s, we created a RAM drive and

tested it up to
3200MB/s which is as fast as the PCIE 2 Bus / LSI could

Handle.

If you cant afford this realtivley cheap equipment (compated to

IBM
DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast


performance. Any real Cloud would have what we have built. None of our

customers, ALL 300 Of them! have lost a single bit of data, all
VMs are

running very quickly, and with no problems now for 3
months.


So If you
dont appreciate being shown how to build a
kick ASS SAN then PISS

OFF!

On 24.10.2012 08:17, Jason Davis
wrote:

How often do you have
folks cache missing and hitting the
spinning rust

below the cache cade
tier?

On Oct 23, 2012
6:25 PM, wrote:





Hi

with our
SCST
RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid

Director 4036

switches + HP Gen2 QDR Chassis Switches.


Our San achieves


295,000 - 350,000 Iops max per LSI controller, depending on
block

size

for i/o meter test. LSI9280-8i with battery cache.
+

cachecade2.0 x 2 in

PCIE 16x slot on supermicro
motherboard.


1
x 250GB SSD for CACHE Cade +

16 1tb
WD VR drives (16 - hot stand buy

- 2 for raid 6 parity) =


2600MB/s


most of our test are from 4K
to 128K block size,
@ 4K we get

295,000 Iops @ 1590 MB/s and @ 128K
350,000 iops @
2400~2600MB/s


we
have tuned 64K and 128K block
size
on different luns. 64K for database

and 128K for general

file.


The best thing about Infiniband is low CPU
cycles. only

5~6% during these tests.


latency is as low
as 4-6ms

average
read time. concurrent response times are from
5-12ms. even under


heavy load its below 20ms


Infiniband latency is below 0.01 us

Which is

why we chose
it.


we run MSSQL on the 64K formatted
Luns, its
massively

fast.

If we copy the same data twice the

cache cade kicks in and we

achieve even better speeds.

Ive

compared this to block io @ 8Gb/s Fiber

channel, and it
barely gets

120,000 IOPS. @ much higher latency and


bandwidth.


So for our
money RDMA wins!


Bruce


On 23.10.2012 21:20,
Jason Davis
wrote:


Out of curiosity, is there any quick performance

numbers
for
these ZFS +

GlusterFS mashups you guys are talking

about?




Specifically, IOPs and latency? Sequential
read/write


performance honestly

isn't a very good
benchmark to determine

your

SANs performance. It's like


comparing CPUs based solely on

how many

GHz it runs at. Sure
you can get

great MB or GB/s with
SATA disk but

I'd
reckon that IOP performance is

abismal. If you
are
utilizing

GlusterFS without the cache pooling magic

that
is

ZFS then I would

imagine that latency can be an issue.






On Tue, Oct 23, 2012 at
7:56 AM,
Andreas Huser wrote:



Hi Fabrice,


i
know

OpenSolaris/Solaris Oracle it's so
a thing.

I'm
for more then 10

years a open source user and
that


with oracle - i did no like at the

beginning of this

constallation.

But Oracle makes his work good i
know
that. The

cost of one socket

are 700$ and you can use so
much

quantity of
TB as you will.

And you can use the
full premier Support

from
Oracle.

Nexenta develop with
the Illumos code. And the Licence


are TB based.

That
is not my favorite. As well the pool version

from

Nexenta
comes

not after. Current Nexenta Infiniband are
not a
usable

solution.

But every can use what he will.
Everyone

must decide for

themselves.


SRP
Targets or iser are not

difficult to configure.

Use the SRP
for

the Storage unit
connection. Solaris and GlusterFS


builds one Storage unit.

The
GlusterFS Server export the
final Volume

to the Clients as well
KVM,

VMWare,
Hyper-V etc.

You can use
native GlusterFS,
RDMA, NFS
ore CIFS to export the Volume.

SRP have
nothing to do

with VmWare.


When you are use a 7200 SAS drive the

access
time are the same as a

SATA drive only the quality
of the


hardware are better. When you need

Performance
you must use SAS


drives with 15000U/m. But it's not needed


when you install SSD
for

ZIL/L2ARC. ZeusRAM rocks :-)




I use dedup only at
secondary stroage

or on
Backupserver not on Primary

Storage.

When you use
SSD SATA

drives then you have an cheap an fast
strorage.


1TB drive cost unter

100$. Current i'm not need to
save
storage volume.


Which
application use atime? I
know

"find -atime N". atime on a Storage


that stores
only Virtual

Disks?

I don't need the information when i

have last opening
the disk :-)


For a Solaris GUI
take a look at

napp-it

http://www.napp-it.org/


greetings from germany

Andreas








----- Ursprüngliche Mail -----




Von:
"Fabrice Brazier">> An:

cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org>
<mailto:cloudstack-users@incubator.apache.org>


Gesendet: Dienstag,
23.

Oktober 2012 12:30:50

Betreff: RE: Primary

Storage


Hi Andreas,


Hum that's
pretty cool, i know they have


still trouble with infiniband
on

nexenta but it's clearly a
priority

on their
roadmap (plus i trust more the

nexenta team
to evolve zfs

than oracle).


I agree iscsi over infiniband

increase the latency

but most of the time

it's
just more

simple to use IPOIB than IB,

for example if you use
vmware the


support of infiniband it's assured

by
mellanox and not vmware so if

you

have
an issue the
vmware

support probably don't help
you anymore.



About the raid i'm a fan

of raid 10, i
prefer build a raid
10 with 7200 sas

drive than
multiple vdev
raidz/raidz2
15k sas drive. Particularly for

virtual

environment
with a random workload and multiple access.


I'm
a
fan

of nfs so agreed about the zil, and zeus ram are the only
one

for


that (with classical ssd you can hit the write
hole).




Agreed for
compression too (but only
lzjb level gzip use too

much cpu),

disable
atime
permit to decrease the iops load but

i'm not sure is really a




best practice. About the deduplication
I don't totally
agree. It

really

depends of your array, the
workload
and the VM type on your

cloudstack.

Actually when i

build a zfs array i count 1Gb ram for 1Tb

disks. With



deduplication i count 2Gb ram for 1 Tb disks (plus in

nexenta 4
they

will

add a possibility to use ssd for the

deduplication metada,

like the l2arc

for the read cache).




The
last point
it's about your drives, why sata
drives? I mean sata

doesn't


respect their latency, and
the reliability of a sas drive is

10x the
sata.

Plus
now you can found many NL-sas drive at low

cost


But it's really a nice architecture, i never try glusterfs



(for the moment)

plus in that case it's really a good way to
have

a

replacement to a

metro-cluster for free, i try
one of is

competitor

(onefs) and clearly

clustered
filesystem are the

futur.


Cheers,


Fabrice


-----Message
d'origine-----

De :
Andreas

Huser
[mailto:ahuser@7five-edv.de]

Envoyé :
mardi 23 octobre 2012


11:37

À :
cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org>

Objet :
Re:


Primary Storage


Hi Fabrice,

I don't
know

what other people

do but i have no problems with
Infiniband +


GlusterFS + Cloudstack

I'm not use
Nexenta it's based on Illumos and

work

not fine with

Infiniband.

I have two different
clusters in productiv


enviroments.


The first: Solaris 11
with built-in
GlusterFS 3.3

export Gluster Vol. with

RDMA. >

performance is okey you can use that

for smaller enviroments
The


Second: Is a little bit complex with a GlusterFS
Server in the

middle


ZFS Server: based on Solaris
11


1.) Create a
zpool min. two vdevs and SSD
read/write cache

2.) Create
a thin
provisioning Volume
"zfs create -V" disable atime and

enable

compression
(do not enable dedup!) and export as (iWarp) SRP Target



to the GlusterFS Server use a direct connection without IB
Switch.


GlusterFS Server:

1.) Use
RedHat, CentOS or

Fedora (I use CentOS 5

and 6)

2.)
Use OFED driver from

https://www.openfabrics.org

3.)

Import the SRP Target from ZFS

Server and format as xfs

4.)
Create a

Glustervolume "volume
create xy transport rdma " (use
only

rdma)
5.) Connect
with the second IB
Port to a IB Switch


Cloudstack

Hypervisor
Node:

1.) Use RedHat, CentOS or Fedora (I use CentOS 5

and

6)

2.) Use OFED driver from
https://www.openfabrics.org


3.) Import
the
Glustervolume


ZFS Thin Volume ----
Infinband SRP
---->

GlusterFS ---- GFSVol rdma ---->

IB Switch
---->
Clients


The
ZFS and GlusterFS Server form a
Storage

unit connect direct with

40Gbit

Infiniband
Point-to-Point You

do not feel that is a cable

between!


Important: When you
have Infiniband use not IPoIB with

iSCSI! If one

already
has Infiniband then you
should also use the

advantage.

IPoIB
have a higher
Latency as iWarp SRP!



SRP =
usec





-- SRP --
local address: LID 0x01 QPN
0x44004b PSN


0xf3265b RKey 0x9804237c VAddr


0x00000001dda000 remote address:

LID

0x0a QPN 0x10004a PSN
0x44072e RKey

0x1c0f115 VAddr

0x000000088e6000






------------------------------------------------------------------





#bytes #iterations t_min[usec] t_max[usec]
t_typical[usec]

2
1000

1.29 125.30 1.31




------------------------------------------------------------------







-- IPoIB ---

[root@sv01sfogaa ~]# ping
10.200.0.10


PING
10.200.0.10 (10.200.0.10) 56(84)
bytes of data.

64
bytes from

10.200.0.10: icmp_seq=1
ttl=255 time=0.147 ms

64
bytes from

10.200.0.10:
icmp_seq=2 ttl=255 time=0.116 ms


When you put load
on

IPoIB then increases the latency that is not
good





That is my
recommendation for a simple

GlusterFS Mirror:


- Supermicro Server
with Intel
Hardware

and Expander Backplane

- 1x Crucal-M4 SSD Read

Cache

- 2x
ZeusIOPs SSD Write cache (mirror)

-
SATA 24/7

Harddrives


- LSI HBA 9207 or 9211

-
ConnectX-2 QDR Dualport

Infiniband
Adaper (HP refurbished with
full

warrany for 100$)
Importent >
Flash newest
Firmware from Mellanox!

- Mellanox IB
Swich


-
Solaris 11

- GlusterFS 3.3 compiled with ib_verbs
-


Gluster Volume transport only rdma











Throughput
constant up to 200
Mbyte/s >> more throughput with

more storage servers

or more
harddrives on jbod >> >> >>







Info:




- I have some
problems with infiniband
RDMA or
SRP with OpenIndiana or

Illumos or
Nexenta. Some

adapters have a high latency or not a stable


connection.
Use

Solaris that's the right way!

- OpenIndiana is Beta!

Infiniband
ib_verbs works not or not fine!

- Use Solaris
11

Infiniband
ib_verbs are native and stable

- Don't
use Ubuntu Client

Server
for Infiniband! Use RedHat, Fedora
or

CentOS and install the

right drivers from


https://www.openfabrics.org/downloads/OFED/


-
You have
not SSD Cache? Disable at the ZFS Volume the sync!



Important!

You
loose Security for your Data but
some

protocols use

sync flags in

transport.
Example NFS use by

default fsync. write

cache are not active.
NFS writes

data
direct to the Harddrive. For

Data
Security and Performance give

the

Storage Server a SSD
write

cache. ZFS works at default with
sync=standard


that prevent write

holes. (COW System)



I hope
that I could help a little


Greeting from

Germany

Andreas



-----
Ursprüngliche

Mail

-----


Von: "Fabrice
Brazier"

An:

cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org>


Gesendet: Dienstag, 23.
Oktober

2012 09:55:15


Betreff: RE: Primary Storage


Hi Andreas,





Hello i just see your configuration, it seems
quite
interesting.

If i
understand well you want to build
some

zfs array on the backend.


Export luns (probably
by iscsi over

infiniband) to you linux cluster,

and


on

the linux
cluster you put glusterFS.

I can
understand

the point, with
that you can have very good
performance

and
reliability
(zfs),
scalability and redundancy (gluster) for very low

cost.

So just one question, did you try the global namespace


implementation

from

nexenta?
If yes can you tell me
what

configuration
is the best for you?

I mean the
fact you have a gluster

cluster
in the middle must impact
the

overral performance no?


Fabrice


-----Message d'origine-----
De :
Andreas

Huser

[mailto:ahuser@7five-edv.de] Envoyé : mardi 23
octobre


2012 05:40 À :


cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org>

Objet :
Re: Primary

Storage


Hi,

for Cloudstack i
use

Solaris 11 ZFS + GlusterFS

over Infiniband (RDMA).


That


gives the best performance and most

scalable
Storage.

I have
tasted some different solutions for
primary

Storage but the most
are

to expensive and for
a CloudStack Cluster

not economic or
have a poor


performance.


My Configuration:


Storage Node:

Supermicro Server (Intel Hardware) with Solaris
11

with

SSD write and read

cache (read crucial-m4,
write ZeusIOPS)

GlusterFS

and dualport ConnectX


40Gbit/s Infiniband adapter.




I have
installed
GlusterFS direct on Solaris with a modified

code.

Want
you

build bigger systems for more then 50 VMs it is
better you
split

the
Solaris and GlusterFS with a separte

headnode for GlusterFS


That
looks like:


Solaris ZFS

Backendstorage with a dataset Volume (Thin


Provision) --> ( SRP


Target attached direct without
Infiniband switch

to GF Node) -->
GlusterFS

Node the
srp target formatted with xfs

filesystem,
create a GlusterFS


Volume --> ( Infiniband over a
Mellanox Port
Switch) -->
Cloudstack Node

mount glusterFS Volume over
RDMA



For the Dataset Volume at the ZFS Storage, disable atime
and


enable

compression.
(Space reclaim) With
compression you

can

shrink the ZFS Volume with command


at Linux dd /dev/zero or

In a

Windows VM with sdelete That
gives you space

left on the
Primary

Storage for
deleted Files in a VM or for deleted vhd's


or vm's in the

cloudstack


greeting Andreas







Mit freundlichen
Grüßen



Andreas Huser


Geschäftsführer

System Engineer
/

Consultant

(Cisco
CSE, SMBAM, LCSE, ASAM)





---------------------------------------


Zellerstraße 28 - 77654


Offenburg

Tel: +49(781)
12786898

Mobil: +49(176) 10308549


ahuser@7five-edv.de<mailto:ahuser@7five-edv.de>





-----

Ursprüngliche Mail -----


Von: "Outback
Dingo"

An:
cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org>



Gesendet: Dienstag, 23.
Oktober 2012 02:15:16

Betreff:
Re: Primary

Storage


On Mon, Oct 22, 2012 at
8:09 PM, Ivan

Rodriguezwrote:







Solaris 11 ZFS and yes we tried different

setups, raids
levels
number >> of SSD cache, ARC zfs options etc etc

etc. >> >>

Cheers >> >> >> VMWare ?? >>













--
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability


Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message