hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Optimizing compactions on super-low-cost HW
Date Mon, 25 May 2015 04:58:28 GMT
Re: blockingStoreFiles
With LSM stores you do not get a smooth behavior when you continuously try to pump more data
into the cluster than the system can absorb.
For a while the memstores can absorb the write in RAM, then they need to flush. If compactions
cannot keep up with the influx of new HFiles, you have two choices: (1) you allow the number
of the HFiles to grow at the expense of read performance, or (2) you tell the clients to slow
down (there are various levels of sophistication about how you do that, but that's besides
the point).
blockingStoreFiles is the maximum number of files (per store, i.e. per column family) that
HBase will allow to accumulate before it stops accepting writes from the clients.In 0.94 it
would simply block for a while. In 0.98 it throws an exception back to the client to tell
it to back off.
-- Lars

     From: Serega Sheypak <serega.sheypak@gmail.com>
 To: user <user@hbase.apache.org>; lars hofhansl <larsh@apache.org> 
 Sent: Sunday, May 24, 2015 12:59 PM
 Subject: Re: Optimizing compactions on super-low-cost HW
   
Hi, thanks!
> hbase.hstore.blockingStoreFiles
Don't understand the idea of this setting, can I find explanation for
"dummies"?

>hbase.hregion.majorcompaction
done already

>DATA_BLOCK_ENCODING, SNAPPY
I always use it by default, CPU OK

> memstore flush size
done


>I assume only the 300g partitions are mirrored, right? (not the entire 2t
drive)
Aha

>Can you add more machines?
Will do it when earn money.
Thank you :)



2015-05-24 21:42 GMT+03:00 lars hofhansl <larsh@apache.org>:

> Yeah, all you can do is drive your write amplification down.
>
>
> As Stack said:
> - Increase hbase.hstore.compactionThreshold, and
> hbase.hstore.blockingStoreFiles. It'll hurt read, but in your case read is
> already significantly hurt when compactions happen.
>
>
> - Absolutely set hbase.hregion.majorcompaction to 1 week (with a jitter if
> 1/2 week, that's the default in 0.98 and later). Minor compaction will
> still happen, based on the compactionThreshold setting. Right now you're
> rewriting _all_ you data _every_ day.
>
>
> - Turning off WAL writing will safe you IO, but I doubt it'll help much. I
> do not expect async WAL helps a lot as the aggregate IO is still the same.
>
> - See if you can enable DATA_BLOCK_ENCODING on your column families
> (FAST_DIFF, or PREFIX are good). You can also try SNAPPY compression. That
> would reduce you overall IO (Since your CPUs are also weak you'd have to
> test the CPU/IO tradeoff)
>
>
> - If you have RAM to spare, increase the memstore flush size (will lead to
> initially larger and fewer files).
>
>
> - Or (again if you have spare RAM) make your regions smaller, to curb
> write amplification.
>
>
> - I assume only the 300g partitions are mirrored, right? (not the entire
> 2t drive)
>
>
> I have some suggestions compiled here (if you don't mind the plug):
> http://hadoop-hbase.blogspot.com/2015/05/my-hbasecon-talk-about-hbase.html
>
> Other than that, I'll repeat what others said, you have 14 extremely weak
> machines, you can't expect the world from this.
> You're aggregate IOPS are less than 3000, you aggregate IO bandwidth
> ~3GB/s. Can you add more machines?
>
>
> -- Lars
>
> ________________________________
> From: Serega Sheypak <serega.sheypak@gmail.com>
> To: user <user@hbase.apache.org>
> Sent: Friday, May 22, 2015 3:45 AM
> Subject: Re: Optimizing compactions on super-low-cost HW
>
>
> We don't have money, these nodes are the cheapest. I totally agree that we
> need 4-6 HDD, but there is no chance to get it unfortunately.
> Okay, I'll try yo apply Stack suggestions.
>
>
>
>
> 2015-05-22 13:00 GMT+03:00 Michael Segel <michael_segel@hotmail.com>:
>
> > Look, to be blunt, you’re screwed.
> >
> > If I read your cluster spec.. it sounds like you have a single i7 (quad
> > core) cpu. That’s 4 cores or 8 threads.
> >
> > Mirroring the OS is common practice.
> > Using the same drives for Hadoop… not so good, but once the sever boots
> > up… not so much I/O.
> > Its not good, but you could live with it….
> >
> > Your best bet is to add a couple of more spindles. Ideally you’d want to
> > have 6 drives. the 2 OS drives mirrored and separate. (Use the extra
> space
> > to stash / write logs.) Then have 4 drives / spindles in JBOD for Hadoop.
> > This brings you to a 1:1 on physical cores.  If your box can handle more
> > spindles, then going to a total of 10 drives would improve performance
> > further.
> >
> > However, you need to level set your expectations… you can only go so far.
> > If you have 4 drives spinning,  you could start to saturate a 1GbE
> network
> > so that will hurt performance.
> >
> > That’s pretty much your only option in terms of fixing the hardware and
> > then you have to start tuning.
> >
> > > On May 21, 2015, at 4:04 PM, Stack <stack@duboce.net> wrote:
> > >
> > > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> > > wrote:
> > >
> > >>> Do you have the system sharing
> > >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each
> drive
> > >> with mirroring enabled. I can't persuade devops that mirroring could
> > cause
> > >> IO issues. What arguments can I bring? They use OS partition mirroring
> > when
> > >> disck fails, we can use other partition to boot OS and continue to
> > work...
> > >>
> > >>
> > > You are already compromised i/o-wise having two disks only. I have not
> > the
> > > experience to say for sure but basic physics would seem to dictate that
> > > having your two disks (partially) mirrored compromises your i/o even
> > more.
> > >
> > > You are in a bit of a hard place. Your operators want the machine to
> boot
> > > even after it loses 50% of its disk.
> > >
> > >
> > >>> Do you have to compact? In other words, do you have read SLAs?
> > >> Unfortunately, I have mixed workload from web applications. I need to
> > write
> > >> and read and SLA is < 50ms.
> > >>
> > >>
> > > Ok. You get the bit that seeks are about 10ms or each so with two disks
> > you
> > > can do 2x100 seeks a second presuming no one else is using disk.
> > >
> > >
> > >>> How are your read times currently?
> > >> Cloudera manager says it's 4K reads per second and 500 writes per
> second
> > >>
> > >>> Does your working dataset fit in RAM or do
> > >> reads have to go to disk?
> > >> I have several tables for 500GB each and many small tables 10-20 GB.
> > Small
> > >> tables loaded hourly/daily using bulkload (prepare HFiles using MR and
> > move
> > >> them to HBase using utility). Big tables are used by webapps, they
> read
> > and
> > >> write them.
> > >>
> > >>
> > > These hfiles are created on same cluster with MR? (i.e. they are using
> up
> > > i/os)
> > >
> > >
> > >>> It looks like you are running at about three storefiles per column
> > family
> > >> is it hbase.hstore.compactionThreshold=3?
> > >>
> > >
> > >
> > >>> What if you upped the threshold at which minors run?
> > >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> > >>
> > >>
> > > Yes.
> > >
> > > Downside is that your reads may require more seeks to find a keyvalue.
> > >
> > > Can you cache more?
> > >
> > > Can you make it so files are bigger before you flush?
> > >
> > >
> > >
> > >>> Do you have a downtime during which you could schedule compactions?
> > >> Unfortunately no. It should work 24/7 and sometimes it doesn't do it.
> > >>
> > >>
> > > So, it is running at full bore 24/7?  There is no 'downtime'... a time
> > when
> > > the traffic is not so heavy?
> > >
> > >
> > >
> > >>> Are you managing the major compactions yourself or are you having
> > hbase do
> > >> it for you?
> > >> HBase, once a day hbase.hregion.majorcompaction=1day
> > >>
> > >>
> > > Have you studied your compactions?  You realize that a major compaction
> > > will do full rewrite of your dataset?  When they run, how many
> storefiles
> > > are there?
> > >
> > > Do you have to run once a day?  Can you not run once a week?  Can you
> > > manage the compactions yourself... and run them a region at a time in a
> > > rolling manner across the cluster rather than have them just run
> whenever
> > > it suits them once a day?
> > >
> > >
> > >
> > >> I can disable WAL. It's ok to loose some data in case of RS failure.
> I'm
> > >> not doing banking transactions.
> > >> If I disable WAL, could it help?
> > >>
> > >>
> > > It could but don't. Enable deferring sync'ing first if you can 'lose'
> > some
> > > data.
> > >
> > > Work on your flushing and compactions before you mess w/ WAL.
> > >
> > > What version of hbase are you on? You say CDH but the newer your hbase,
> > the
> > > better it does generally.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > >> 2015-05-20 18:04 GMT+03:00 Stack <stack@duboce.net>:
> > >>
> > >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > >> serega.sheypak@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi, we are using extremely cheap HW:
> > >>>> 2 HHD 7200
> > >>>> 4*2 core (Hyperthreading)
> > >>>> 32GB RAM
> > >>>>
> > >>>> We met serious IO performance issues.
> > >>>> We have more or less even distribution of read/write requests.
The
> > same
> > >>> for
> > >>>> datasize.
> > >>>>
> > >>>> ServerName Request Per Second Read Request Count Write Request
Count
> > >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> > >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> > >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> > >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> > >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> > >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> > >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> > >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> > >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> > >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> > >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> > >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> > >>>>
> > >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > >>>> Storefile
> > >>>> Size Index Size Bloom Size
> > >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> 641849k
> > >>>> 310111k
> > >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> 649610k
> > >>>> 318854k
> > >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> 627346k
> > >>>> 307136k
> > >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> 655954k
> > >>>> 289316k
> > >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> 688136k
> > >>>> 334127k
> > >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> 631774k
> > >>>> 296169k
> > >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> 681486k
> > >>>> 312325k
> > >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> 658924k
> > >>>> 309734k
> > >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> 664753k
> > >>>> 264081k
> > >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> 652970k
> > >>>> 304137k
> > >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb
> 601684k
> > >>>> 257607k
> > >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb
> 592812k
> > >>>> 266677k
> > >>>>
> > >>>>
> > >>>> When compaction starts  random node gets I/O 100%, io wait for
> > seconds,
> > >>>> even tenth of seconds.
> > >>>>
> > >>>> What are the approaches to optimize minor and major compactions
when
> > >> you
> > >>>> are I/O bound..?
> > >>>>
> > >>>
> > >>> Yeah, with two disks, you will be crimped. Do you have the system
> > sharing
> > >>> with hbase/hdfs or is hdfs running on one disk only?
> > >>>
> > >>> Do you have to compact? In other words, do you have read SLAs?  How
> are
> > >>> your read times currently?  Does your working dataset fit in RAM or
> do
> > >>> reads have to go to disk?  It looks like you are running at about
> three
> > >>> storefiles per column family.  What if you upped the threshold at
> which
> > >>> minors run? Do you have a downtime during which you could schedule
> > >>> compactions? Are you managing the major compactions yourself or are
> you
> > >>> having hbase do it for you?
> > >>>
> > >>> St.Ack
> > >>>
> > >>
> >
> >
>

  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message