hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Mains (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13356) HBase should provide an InputFormat supporting multiple scans in mapreduce jobs over snapshots
Date Mon, 30 Mar 2015 05:56:53 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386231#comment-14386231
] 

Andrew Mains commented on HBASE-13356:
--------------------------------------

Spent some time speccing out a potential implementation for this today:

Interface: 

Jobs wanting to run multiple scans over snapshots can use MultiTableSnapshotInputFormat. This
can be configured using TableMapreduceUtil, as usual, with the signature:
{code}
  /**
   *  Sets up the job for reading from one or more multiple table snapshots, with one or more
scan per snapshot.
   *  It bypasses hbase servers and read directly from snapshot files.
   *
   * @param snapshotScans map of snapshot name to a list of scans on that snapshot.
   * @param mapper  The mapper class to use.
   * @param outputKeyClass  The class of the output key.
   * @param outputValueClass  The class of the output value.
   * @param job  The current job to adjust.  Make sure the passed job is
   * carrying all necessary HBase configuration.
   * @param addDependencyJars upload HBase jars and jars for any of the configured
   *           job classes via the distributed cache (tmpjars).
   */
  public static void initMultiTableSnapshotMapperJob(Map<String, Collection<Scan>>
snapshotScans,
                                                     Class<? extends TableMapper> mapper,
                                                     Class<?> outputKeyClass,
                                                     Class<?> outputValueClass, Job
job,
                                                     boolean addDependencyJars, Path tmpRestoreDir
  ) throws IOException {
{code}

Implementation:

Most of the work can be done through delegation to TableSnapshotInputFormatImpl. The primary
change would be to make TableSnapshotInputFormatImpl.InputSplit take in a scan object and
restoreDir path, instead of retrieving these from the job configuration. This would allow
MultiTableSnapshotInputFormat to avoid setting an individual scan and restore directory on
the configuration (they can be passed along by way of the split, similar to TableSplit).

Tests:

Any implementation should probably pass at least the tests for MultiTableInputFormat, and
possibly some of the tests for TableSnapshotInputFormat as well.

Thoughts?

> HBase should provide an InputFormat supporting multiple scans in mapreduce jobs over
snapshots
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13356
>                 URL: https://issues.apache.org/jira/browse/HBASE-13356
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce
>            Reporter: Andrew Mains
>            Priority: Minor
>
> Currently, HBase supports the pushing of multiple scans to mapreduce jobs over live tables
(via MultiTableInputFormat) but only supports a single scan for mapreduce jobs over table
snapshots. It would be handy to support multiple scans over snapshots as well, probably through
another input format (MultiTableSnapshotInputFormat?). To mimic the functionality present
in MultiTableInputFormat, the new input format would likely have to take in the names of all
snapshots used in addition to the scans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message