hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: Custom compaction
Date Thu, 27 May 2010 13:18:26 GMT
This is not currently on any road map as far as I know.  But I do think it's interesting nonetheless.

Piggybacking on compactions can be a good time to get some additional work done on your data
since we're already doing the work of reading and writing several HFiles.

One concern is compaction performance.  In HBase's architecture, overall performance can be
significantly impacted by slow-running compactions.

Another concern is that minor compactions do not always include all files of a region.  That
may limit what you can effectively do during a compaction since you may not be seeing all
of the data.  This is not the case for major compactions which always compact every file in
a region, however.

Friso, for your specific use case, what you are trying to do is evict older versions of data?
 I had a little bit of trouble understanding your schema.  Or what you're doing is periodically
take a bunch of versions of a column and combine them into a single version/value?  How many
of these versions are you adding for each column?   Is it really the case that read performance
is unacceptable if the data is spread across multiple versions?  One of the benefits of HBase
is that these versions will be stored sequentially on disk so the read of multiple versions
(within reason) should be not significantly slower than one.

In any case, this is an interesting direction and I think it's worth exploring.  As for how
this would work, that I'm not so sure about.  Perhaps building on Andrew's work with Coprocessors,
RegionObservers, etc...


> -----Original Message-----
> From: Friso van Vollenhoven [mailto:fvanvollenhoven@xebia.com]
> Sent: Thursday, May 27, 2010 1:34 AM
> To: user@hbase.apache.org
> Subject: Re: Custom compaction
> Hi,
> Actually, for us it would be nice to be able to hook into the
> compaction, too.
> We store records that are basically events that occur at certain times.
> We store the record itself as qualifier and a timeline as column value
> (so multiple records+timelines per row key is possible). So when a new
> record comes in, we do a get for the timeline, merge the new timestamp
> with the existing timeline in memory and do a put to update the column
> value with the new timeline.
> In our first version, we just wrote the individual timestamps as values
> and used versioning to keep all timestamps in the value. Then we
> combined all the timelines and individual timestamp into a single
> timeline in memory on each read. We ran a MR job periodically to do the
> timeline combining in the table and delete the obsolete timestamps in
> order to keep read performance OK (because otherwise the read operation
> would involve a lot of additional work to create a timeline and lots of
> versions would be created). In the end, the deletes in the MR job were
> a bottleneck (as I understand, but I was not on the project at that
> moment).
> Now, if we could hook into the compactions, then we could just always
> insert individual timestamps as new versions and do the combining of
> versions into a single timeline during compaction (as compaction needs
> to go through the complete table anyway). This would also improve our
> insertion performance (no more gets in there, just puts like in the
> first version), which is nice. We collect internet routing information,
> which is collected at 80 million records per day with updates coming in
> in batches every 5 minutes (http://ris.ripe.net). We'd like to try to
> be efficient before just throwing more machines at the problem.
> Will there be anything like this on the roadmap?
> Cheers,
> Friso
> On May 27, 2010, at 1:01 AM, Jean-Daniel Cryans wrote:
> > Invisible. What's your need?
> >
> > J-D
> >
> > On Wed, May 26, 2010 at 3:56 PM, Vidhyashankar Venkataraman
> > <vidhyash@yahoo-inc.com> wrote:
> >> Is there a way to customize the compaction function (like a hook
> provided by the API) or is it invisible to the user?
> >>
> >> Thank you
> >> Vidhya
> >>

View raw message