crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Loading HFiles via Crunch / Job output post processing steps
Date Wed, 03 Oct 2012 04:33:26 GMT
Hi Kiyan,

On Tue, Oct 2, 2012 at 11:32 PM, Kiyan Ahmadizadeh <> wrote:
> HBase allows clients to load data into HBase by generating HFiles in a
> MapReduce job and then loading those HFiles into HBase via running the
> CompleteBulkLoad tool.  We'd like to enable this behavior in Crunch.
> Getting crunch to generate HFiles as the result of the job is as simple as
> configuring the correct output format.  The question of where/when to
> invoke the CompleteBulkLoad tool on those generated files is a little
> trickier.  I originally posed this question to just Josh but on his
> suggestion I thought I'd open it up to the whole group.  Josh's original
> response is below and suggests adding a callback mechanism to Target.  This
> sounds like a good idea to me.  Does anyone else have some thoughts / ideas
> on the issue?

It's been quite a while since I worked with bulk imports in HBase, but
from what I remember previously (and taking a look in the current
HBase trunk), I don't think it's necessarily as simple as just writing
to HFileOutputFormat to do a bulk load.

I think (and please correct me if I'm wrong) that additional
requirements for doing a bulk load (at least for an existing table)
are having all Puts (or KeyValues) sorted by total order, as well as
having all KeyValues partitioned according to existing regions, with
the partitioning being consistent over all column families. These
components can probably be largely plugged into a pipeline, but it's
more complex than just just setting the output format to

Seeing as there is extra functionality (i.e. dedicated pipeline code)
needed in order to facilitate this, I'm wondering if adding callback
hooks to Target is worth it -- it might be easier to just add a call
to and then run the CompleteBulkLoad tool in the
dedicated pipeline code that sets up the sorted and partitioned

- Gabriel

View raw message