crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Brush (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-246) HFileSource
Date Wed, 07 Aug 2013 14:24:48 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13732031#comment-13732031
] 

Ryan Brush commented on CRUNCH-246:
-----------------------------------

I actually wrote an HFileInputFormat for a specific internal need some time ago that I'd be
happy to contribute to this (it's actually quite simple).  However, I think HBase intentionally
doesn't include an HFileInputFormat since it would be prone to error, some of which is discussed
at [1].  In short, the input wouldn't include data sitting in HBase's write ahead log -- and
therefore logically written to HBase -- but not yet flushed to an HFile. There are ways you
can work around this if you really know what you're doing...but it might be worth debating
whether something like that makes sense in a public API. FWIW, the HBase community seems to
have decided this isn't a good thing to expose broadly.

I recall some discussion of HBase snapshotting opening the door to running MapReduce jobs
directly against HFiles in a predictable way, but unfortunately I can't find a reference to
that. If and when that becomes available it would make complete sense to expose in Crunch
(and I'm sure some folks here at Cerner would be happy to contribute to that as well).

[1]
http://mail-archives.apache.org/mod_mbox/hbase-user/201202.mbox/%3C-684097822435171056@unknownmsgid%3E
                
> HFileSource
> -----------
>
>                 Key: CRUNCH-246
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-246
>             Project: Crunch
>          Issue Type: Improvement
>          Components: IO
>            Reporter: Chao Shi
>            Assignee: Chao Shi
>
> I found this useful when directly perform MR on HFiles. I used it yesterday when copying
a bunch of HFiles to another cluster (where the region layout is different).
> There is no HFileInputFormat provided by HBase, but I found the following from google:
> https://gist.github.com/leifwickland/1120311
> http://blog.csdn.net/kirayuan/article/details/7794402 (Java version of the above. The
webpage is in chinese, but you can see the code)
> I'm not sure if we copy their code directly (copyright issue?). Anyone knows?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message