hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2615) Add max number of mapfiles to compact at one time giveing us a minor & major compaction
Date Fri, 25 Jan 2008 06:28:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562374#action_12562374
] 

Billy Pearson commented on HADOOP-2615:
---------------------------------------

Thats what I see too the split never happens when a region is under load of inserts. I still
thank if we are going to have transactions speed close to bigtables we will need to add a
limit on number of map files to compaction at one time. 
Even if HADOOP-2636 get the flushing working right for performance point of view I thank it
should be included as any ways to handle large number of regions per server. 

I am seeing 10-15 mins to run compaction on a 90MB region using block compression. 
So if you consider that most will want to handed more then 25-50 regions per server.

Say avg region server holds 100 regions thats going to work out to be 100*10mins = 1000 mins
= 16hours to run a full compaction on all the regions.
By havening this in place on regions getting large update traffic the map files will not get
out of control.

100 regions with 90MB avg size only equals about 9GB of compressed data.
I would like to see closer to production release better compression method used. 
This would help with compaction speed right now my bottle neck on compaction is compression.

{New Idea}
After thinking on this a little not sure doing a compaction on the number of map files it
the best way to go.
Compaction on 3-6 small 1-2mb map files does not take that long even with compression so the
idea way to do this would be to only 
compaction small files while we have small files to compaction leaving more larger map files
to compact in the end when load is as high.

big tables has the right idea only do a full/major compaction of all the map files every so
often to remove deleted data or data out of its max version range.
so we might want to look at the idea of removing the compaction based on the number of map
files to a limit on the size of the map files
example say we have a region family compaction max size 16MB we could only compact files under
that size once we compact regions that total more then the 
max compaction size then do not include that map file in the next compaction. This would leave
map files around the same size to be compacted together say once a day and/or after splits.
also I would like to keep the region server handle the compaction on there own so the master
can be left alone to do other more important task.

Currently if you load a region server with many regions it always be running compaction's
on the regions if there getting data inserted.
So this would lesses the load on the hard drives, memory, and cpus giving more resources for
faster/more  transactions.

> Add max number of mapfiles to compact at one time giveing us a minor & major compaction
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2615
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2615
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>            Priority: Minor
>             Fix For: 0.17.0
>
>         Attachments: flag.patch, twice.patch
>
>
> Currently we do compaction on a region when the hbase.hstore.compactionThreshold is reached
- default 3
> I thank we should configure a max number of mapfiles to compact at one time simulator
to doing a minor compaction in bigtable. This keep compaction's form getting tied up in one
region to long letting other regions get way to many memcache flushes making compaction take
longer and longer for each region
> If we did that when a regions updates start to slack off the max number will eventuly
include all mapfiles causeing a major compaction on that region. Unlike big table this would
leave the master out of the process and letting the region server handle the major compaction
when it has time.
> When doing a minor compaction on a few files I thank we should compact the newest mapfiles
first leave the larger/older ones for when we have low updates to a region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message