hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-11482) Optimize HBase TableInputFormat and TableOutputFormat for tables and snapshots as Spark RDDs
Date Tue, 08 Jul 2014 21:12:04 GMT
Andrew Purtell created HBASE-11482:
--------------------------------------

             Summary: Optimize HBase TableInputFormat and TableOutputFormat for tables and
snapshots as Spark RDDs
                 Key: HBASE-11482
                 URL: https://issues.apache.org/jira/browse/HBASE-11482
             Project: HBase
          Issue Type: New Feature
            Reporter: Andrew Purtell


A core concept of Apache Spark is the resilient distributed dataset (RDD), a "fault-tolerant
collection of elements that can be operated on in parallel". One can create a RDDs referencing
a dataset in any external storage system offering a Hadoop InputFormat, like HBase's TableInputFormat
and TableSnapshotInputFormat. 

Insure the integration is reasonable and provides good performance. 

Add the ability to save RDDs back to HBase with a {{saveAsHBaseTable}} action, implicitly
creating necessary schema on demand.

Add support for {{filter}} transformations that push predicates down to the server as HBase
filters. 

Consider supporting conversions between Scala and Java types and HBase data using the HBase
types library.

Consider an option to lazily and automatically produce a snapshot only when needed, in a coordinated
way. (Concurrently executing workers may want to materialize a table snapshot RDD at the same
time.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message