Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CE89110619 for ; Thu, 18 Apr 2013 17:56:22 +0000 (UTC) Received: (qmail 47443 invoked by uid 500); 18 Apr 2013 17:56:20 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 47324 invoked by uid 500); 18 Apr 2013 17:56:19 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 47246 invoked by uid 99); 18 Apr 2013 17:56:19 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 17:56:19 +0000 Date: Thu, 18 Apr 2013 17:56:19 +0000 (UTC) From: "Enis Soztutar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-8369) MapReduce over snapshot files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635428#comment-13635428 ] Enis Soztutar commented on HBASE-8369: -------------------------------------- bq. in general I'm against having another way to direct access the data, since it means that you're giving up on optimizing the main one. Conceptually, this is similar to the short circuit reads for HDFS. I agree that we should not need these kinds of optimizations, since in the long term, it will be impossible to implement QoS for IO if you give direct access to local files (for SSR) / hdfs files (for snapshot). bq. if the final implementation will be like this one using the HRegion object, I'll be +1. Yes, that is the plan. bq. Are the initCredentials modifications in TableMapReduceUtil required for the scope of this patch? Yes, we do not need to initCredentials, since we are not talking to any hbase server. > MapReduce over snapshot files > ----------------------------- > > Key: HBASE-8369 > URL: https://issues.apache.org/jira/browse/HBASE-8369 > Project: HBase > Issue Type: New Feature > Components: mapreduce, snapshots > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 0.98.0, 0.95.2 > > Attachments: hbase-8369_v0.patch > > > The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. > Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok: > - Take snapshots periodically, and run MR jobs only on snapshots. > - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster. > - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira