hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: When does compaction actually occur?
Date Wed, 06 Jun 2012 09:06:10 GMT
Hi Tom,

You have set MIN_VERSIONS to 1. That tells HBase that for this column family you want to keep
at least 1 version of a cell around regardless of whether it expired (due to TTL) or not.
I think if you remove that it will behave as you expect.

As a general rule a compaction will never influence visibility of data that was inserted before
the compaction (except for RAW scans), and hence you should never need to ask when a compaction
happens - unless you are running out of disk space.

-- Lars
From: Tom Brown <tombrown52@gmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, June 5, 2012 2:37 PM
Subject: Re: When does compaction actually occur?


In response to your earlier email, I'm not completely sure whether or
not I'm using a raw scan. The scan is performed in a region server
coprocessor initialized as such:

        Scan scan = new Scan()
            .setTimeRange(myMinTimeStamp, myMaxTimeStamp)

        InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment())

The scan is indeed being filtered to the range I provide (using
setTimeRange), but it will retrieve records much older than should be
allowed given the TTL.

I have multiple tables setup in a similar fashion, but here a
description of one of them:

{NAME => 'facts', FAMILIES => [{NAME => 'd', BLOOMFILTER => 'ROW',
=> '1'}]}

I'm building an OLAP cube for this project and want to make sure the
data size doesn't grow through the roof. Whether or not data expires
after exactly one hour is not an absolute requirement for this use
case. But I want to know why the system is not behaving as I think I
configured it to behave.



On Sun, Jun 3, 2012 at 2:57 AM, Lars George <lars.george@gmail.com> wrote:
> What Amandeep says and also keep in mind that with the current selection process HBase
holds O(log N) files for N data. So say for 2GB region sizes you get 2-3 files. This means
it very "aggressively" is compacting files, and most of these are "all files included" once...
which are the promoted to major compactions implicitly. That way your predicate deletes should
be in effect and you will only need scheduled major compactions only ever so often.
> Lars
> On Jun 2, 2012, at 1:04 AM, Amandeep Khurana wrote:
>> Tom,
>> Old cells will get deleted as a part of the next major compaction, which is typically
recommended to be done once a day, when the load on the system is at its lowest.
>> FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction every
hour, which is an expensive operation specially at scale. Chances are that your I/O loads
will shoot up and latencies will spike for operations to HBase. Can you tell us why a TTL
of 3600s is of interest? What are your access patterns?
>> -Amandeep
>> On Friday, June 1, 2012 at 3:59 PM, Tom Brown wrote:
>>> I have a table that holds rotating data. It has a TTL of 3600. For
>>> some reason, when I scan the table I still get old cells that are much
>>> older than that TTL.
>>> I have tried issuing a compaction request via the web UI, but that
>>> didn't seem to do anything.
>>> Am I misunderstanding the data model used by HBase? Is there anything
>>> else I can check to verify the functionality of my integration?
>>> I am using HBase 0.92 with Hadoop 1.0.2.
>>> Thanks in advance!
>>> --Tom

View raw message