Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A2D10113A9 for ; Tue, 8 Jul 2014 21:12:05 +0000 (UTC) Received: (qmail 36020 invoked by uid 500); 8 Jul 2014 21:12:04 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 35854 invoked by uid 500); 8 Jul 2014 21:12:04 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 35615 invoked by uid 99); 8 Jul 2014 21:12:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jul 2014 21:12:04 +0000 Date: Tue, 8 Jul 2014 21:12:04 +0000 (UTC) From: "Andrew Purtell (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-11482) Optimize HBase TableInputFormat and TableOutputFormat for tables and snapshots as Spark RDDs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Andrew Purtell created HBASE-11482: -------------------------------------- Summary: Optimize HBase TableInputFormat and TableOutputFormat for tables and snapshots as Spark RDDs Key: HBASE-11482 URL: https://issues.apache.org/jira/browse/HBASE-11482 Project: HBase Issue Type: New Feature Reporter: Andrew Purtell A core concept of Apache Spark is the resilient distributed dataset (RDD), a "fault-tolerant collection of elements that can be operated on in parallel". One can create a RDDs referencing a dataset in any external storage system offering a Hadoop InputFormat, like HBase's TableInputFormat and TableSnapshotInputFormat. Insure the integration is reasonable and provides good performance. Add the ability to save RDDs back to HBase with a {{saveAsHBaseTable}} action, implicitly creating necessary schema on demand. Add support for {{filter}} transformations that push predicates down to the server as HBase filters. Consider supporting conversions between Scala and Java types and HBase data using the HBase types library. Consider an option to lazily and automatically produce a snapshot only when needed, in a coordinated way. (Concurrently executing workers may want to materialize a table snapshot RDD at the same time.) -- This message was sent by Atlassian JIRA (v6.2#6252)