incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiyan Ahmadizadeh <ki...@wibidata.com>
Subject Loading HFiles via Crunch / Job output post processing steps
Date Tue, 02 Oct 2012 21:32:03 GMT
Hi Everyone,

HBase allows clients to load data into HBase by generating HFiles in a
MapReduce job and then loading those HFiles into HBase via running the
CompleteBulkLoad tool.  We'd like to enable this behavior in Crunch.

Getting crunch to generate HFiles as the result of the job is as simple as
configuring the correct output format.  The question of where/when to
invoke the CompleteBulkLoad tool on those generated files is a little
trickier.  I originally posed this question to just Josh but on his
suggestion I thought I'd open it up to the whole group.  Josh's original
response is below and suggests adding a callback mechanism to Target.  This
sounds like a good idea to me.  Does anyone else have some thoughts / ideas
on the issue?

Thanks!

-Kiyan

>From Josh:
Are you asking from the Crunch perspective, or the HBase perspective?
HBase has the CompleteBulkLoad tool, so I'm assuming you're asking
about the right way to wire it up into Crunch?

http://hbase.apache.org/book/arch.bulk.load.html

It seems like we would want a callback on Targets that would notify
them that the output they were interested in had been generated and
that they should do whatever subsequent processing on it they would
want, right? That could either be a hook on Target itself, or some
sub-interface of Target that we might check for at the end of a job--
but it seems like putting it on Target itself is the right approach.
Is that what you guys were contemplating?

Also, feel free to put this on crunch-dev, I'm sure other folks will
be interested even if they don't have a lot to contribute to the
implementation discussion.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message