hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-2001) Coprocessors: Colocate user code with regions
Date Mon, 08 Feb 2010 05:45:29 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Purtell updated HBASE-2001:
----------------------------------

    Attachment: HBASE-2001.patch.gz

Latest patch contains simple working unit tests for basic Coprocessor hooks and also RegionObserver
interface hooks. 

Also, the initial implementation of an in-process MapReduce framework. Coprocessors can optionally
implement a 'MapReduce' interface which clients will at some point be able to invoke concurrently
on all regions of the table within the HRS processes. (Server side needs unit tests and testing;
no client side yet.) Note this is not MapReduce on the table; this is MapReduce on each region,
concurrently.

In-process MapReduce is multithreaded. Concurrency of mappers and reducers is specified separately.
Map jobs are submitted with a Scan object which defines the scope and any filters for a scanner
which feeds mappers. Mappers can emit intermediate KeyValues to a collector for reduction
or can get references to objects in the coprocessor's environment and perform operations on
them, e.g. increment an AtomicLong, etc. Reducers will get KeyValues from map phase output
ordered and grouped by key. Reducers also have access to objects in the coprocessor environment.
Therefore one can implement MapReduce in a manner very similar to Hadoop's MR framework, or
e.g. aggregating functions can use shared variables to avoid the overhead of generating (and
processing) a lot of intermediates.

An in-process MapReduce job can be configured to auto commit. If so, KeyValues written to
the reduce collector by reducers will be automatically committed back to the region after
all reducers have completed execution. Up until all mappers and reducers successfully complete
execution no values are committed to the region. Then, we try really hard to commit them all.


KeyValues emitted by reducers must have a row key that falls within the bounds of the region
if the job is auto committing. Otherwise, the output can be arbitrary.

If a job is not auto committing, when it completes clients have access to the KeyValues output
by the reducer via a scanner like interface. 

The in-process MapReduce framework uses leases. A job is only alive as long as it has a lease.
Its output KeyValues are only available as long as it has a lease. So for long running jobs
the client must periodically poll status to keep it alive, and then retrieval by "scanner"
will also renew the lease. A lease cannot expire during auto commit. 


> Coprocessors: Colocate user code with regions
> ---------------------------------------------
>
>                 Key: HBASE-2001
>                 URL: https://issues.apache.org/jira/browse/HBASE-2001
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>         Attachments: asm-3.2-bin.zip, asm-transformations.pdf, HBASE-2001.patch.gz
>
>
> Support user code that runs run next to each region in table. As regions split and move,
coprocessor code should automatically  move also.
> Use classloader which looks on HDFS.
> Associate a list of classes to load with each table. Put this in HRI so it inherits from
table but can be changed on a per region basis (so then those region specific changes can
inherited by daughters). 
> Not completely arbitrary code, should require implementation of an interface with callbacks
for:
> * Open
> * Close
> * Split
> * Compact
> * (Multi)get and scanner next()
> * (Multi)put
> * (Multi)delete
> Add method to HRegionInterface for invoking coprocessor methods and retrieving results.
 
> Add methods in o.a.h.h.regionserver or subpackage which implement convenience functions
for coprocessor methods and consistent/controlled access to internals: store access, threading,
persistent and ephemeral state, scratch storage, etc. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message