hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8369) MapReduce over snapshot files
Date Fri, 13 Dec 2013 01:02:26 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847004#comment-13847004

Lars Hofhansl commented on HBASE-8369:

The only changes to existing HBase classes are exactly these hooks, though. Without them it
cannot be done with outside code. When those are in place anyway, might as well add some new
classes for M/R stuff; but it's fine to keep these outside, they just become part of the M/R
job then.

To explain my comment above:
Adding a few classes is not a fork of course, but it starts a slippery slope. Once you started
it's easy to pile on top of that. And there are some HBase changes needed, so it is an actual
patch we need to maintain.
We have so far completely avoided that (except for some hopefully temporary security related
changes to HDFS), and I have been a strong advocate for that in our organization. We have
also always forward ported any changes we made to 0.96+. So it is frustrating having to start
this even (or especially) for such a small change.

So please pardon my frustration.
I do not understand the reluctance with this, as it is almost no risk and some folks will
be using 0.94 for a while.
Whether it's a new "feature" or not is not relevant (IMHO). HBase's slow M/R performance could
be considered a bug too, and then this would be bug fix.

We're not breaking up over this :)

So it seems a good compromise would be to get the required hooks into HBase...?

> MapReduce over snapshot files
> -----------------------------
>                 Key: HBASE-8369
>                 URL: https://issues.apache.org/jira/browse/HBASE-8369
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, snapshots
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0
>         Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch, HBASE-8369-0.94_v3.patch,
HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch, HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch,
HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch, hbase-8369_v5.patch,
hbase-8369_v6.patch, hbase-8369_v7.patch, hbase-8369_v8.patch, hbase-8369_v9.patch
> The idea is to add an InputFormat, which can run the mapreduce job over snapshot files
directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking
a Scan object from the user, but instead of running from an online table, it runs from a table
snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader.
A RegionScanner is used internally for doing the scan without any HRegionServer bits. 
> Users have been asking and searching for ways to run MR jobs by reading directly from
hfiles, so this allows new use cases if reading from stale data is ok:
>  - Take snapshots periodically, and run MR jobs only on snapshots.
>  - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase
>  - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's
snapshot, but read today's data from online hbase cluster. 

This message was sent by Atlassian JIRA

View raw message