hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase compaction question
Date Fri, 21 Feb 2014 23:47:22 GMT
bq. job A that read from HBase and it takes about 30 minutes

In your current practice, do you observe increase in duration for job A ?
This would be an indication of whether the minor compactions have reduced the
number of HFiles to acceptable level.

You should take a look at http://hbase.apache.org/book.html#hbase_metrics,
especially NumberOfStorefiles


On Fri, Feb 21, 2014 at 10:55 AM, Chen Song <chen.song.82@gmail.com> wrote:

> Below is the brief description of our use case
>     * We use Cloudera CDH4
>     * Our cluster has ~85 nodes and it stores one big table called "imps"
>     * imps table is pre-split into 5000 regions, hence each node (region
> server) has about 60 regions.
>     * Each hour, we bulk load one hour worth of data in the volume of 40G,
> with partitions generated based on region splits. We set TTL to 33 days so
> total amount of data stored for that table is 33 * 40 * 24 = ~31T, assuming
> major compaction works properly. Another indication of this process is
> every our, each region will have its HFile count increased by 1.
>     * We have automatic major compaction disabled as suggested.
>     * There is an hourly job A that read from HBase and it takes about 30
> minutes, so we can't really keep HBase downtime for hours.
>     * The loading time + job A running time is about 45 minutes. So
> effectively these is 15 minutes each hour HBase is not being used.
> The problem we have ran into is with compaction.
>     * The first thing we tried is to explicitly schedule major compaction
> for the top (with most HFiles) 5 -10 regions per region server. This is
> done every hour and the idea behind it is that we want to use the 15
> minutes to compact the heaviest regions and with the hope to cycle through
> all regions in each RS in 6 - 12 hours. However, there was some problem.
>         * On CDH4, HBase major compaction can only be scheduled as
> asynchronous and there is no way to trace it.
>         * It used to work fine but as data grows more and more, the
> asynchronous major compaction took more and more time.
>         * Because of the above 2 facts, we see compaction queue piped up
> and compaction never caught up.
>     *
> Then we disabled the first option and we resorted to automatic minor
> compaction. We let HBase to manage itself and it works so far. However, our
> concern is still there, as data grows, will it have the same problem as the
> first option?
> Let me know if you need further clarification or any questions. Thank you
> very much in advance.
> Best,
> --
> Chen Song

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message