hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-3936) Incremental bulk load support for Increments
Date Mon, 30 May 2011 16:02:47 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Purtell updated HBASE-3936:
----------------------------------

    Description: 
>From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a MapReduce
job to output table data in HBase's internal data format, and then directly loads the data
files into a running cluster. Using bulk load will use less CPU and network than going via
the HBase API."

I have been working with a specific implementation of, and can envision, a class of applications
that reduce data into a large collection of counters, perhaps building projections of the
data in many dimensions in the process. One can use Hadoop MapReduce as the engine to accomplish
this for a given data set and use LoadIncrementalHFiles to move the result into place for
live serving. MR is natural for summation over very large counter sets: emit counter increments
for the data set and projections thereof in mappers, use combiners for partial aggregation,
use reducers to do final summation into HFiles.

However, it is not possible to then merge in a set of updates to an existing table built in
the manner above without either 1) joining the table data and the update set into a large
MR temporary set, followed by a complete rewrite of the table; or 2) posting all of the updates
as Increments via the HBase API, impacting any other concurrent users of the HBase service,
and perhaps taking 10-100 times longer than if updates could be computed directly into HFiles
like the original import. Both of these alternatives are expensive in terms of CPU and time;
one is also expensive in terms of disk.

I propose adding incremental bulk load support for Increments. Here is a sketch of a possible
implementation:

* Add a KV type for Increment

* Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles directly to handle
the new KV type

* Bulk load API can move the files to be merged into the Stores as before.

* Implement an alternate compaction algorithm or modify the existing. Need to identify Increments
and apply them to an existing most recent version of a value, or create the value if it does
not exist.
  ** Use KeyValueHeap as is to merge value-sets by row as before.
  ** For each row, use a KV-keyed Map for in memory update of values.
  ** If there is an existing value and it is not a serialized long, ignore the Increment and
log at INFO level.
  ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with an appropriate
memory limit, so work for overlarge rows will spill to disk. Can be local disk, not HDFS.

* Never return an Increment KV to a client doing a Get or Scan. 
  ** Before the merge is complete, if we find an Increment KV when searching Store files for
a value, continue searching back in the Store files until we find a Put KV for the value,
adding up Increments as they are encountered, then applying them to the Put value; or until
search ends, in which case the Increment is treated as a Put.
  ** If there is an existing value and it is not a serialized long, ignore the Increment and
log at INFO level.

* As a beneficial side effect, with Increments as just another KV type we can unify Put and
Increment handling.

Because this is a core concern I'd prefer discussing this as a possible enhancement of core
as opposed to a Coprocessor-based extension. However it could be possible to implement all
but the KV changes within the Coprocessor framework.


  was:
>From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a MapReduce
job to output table data in HBase's internal data format, and then directly loads the data
files into a running cluster. Using bulk load will use less CPU and network than going via
the HBase API."

I have been working with a specific implementation of, and can envision, a class of applications
that reduce data into a large collection of counters, perhaps building projections of the
data in many dimensions in the process. One can use Hadoop MapReduce as the engine to accomplish
this for a given data set and use LoadIncrementalHFiles to move the result into place for
live serving. MR is natural for summation over very large counter sets: emit counter increments
for the data set and projections thereof in mappers, use combiners for partial aggregation,
use reducers to do final summation into HFiles.

However, it is not possible to then merge in a set of updates to an existing table built in
the manner above without either 1) joining the table data and the update set into a large
MR temporary set, followed by a complete rewrite of the table; or 2) posting all of the updates
as Increments via the HBase API, impacting any other concurrent users of the HBase service,
and perhaps taking 10-100 times longer than if updates could be computed directly into HFiles
like the original import. Both of these alternatives are expensive in terms of CPU and time;
one is also expensive in terms of disk.

I propose adding incremental bulk load support for Increments. Here is a sketch of a possible
implementation:

* Add a KV type for Increment

* Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles directly to handle
the new KV type

* Bulk load API can move the files to be merged into the Stores as before.

* Implement an alternate compaction algorithm or modify the existing. Need to identify Increments
and apply them to an existing most recent version of a value, or create the value if it does
not exist.
  ** Use KeyValueHeap as is to merge value-sets by row as before.
  ** For each row, use a KV-keyed Map for in memory update of values.
  ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with an appropriate
memory limit, so work for overlarge rows will spill to disk. Can be local disk, not HDFS.

* Never return an Increment KV to a client doing a Get or Scan. 
  ** Before the merge is complete, if we find an Increment KV when searching Store files for
a value, continue searching back in the Store files until we find a Put KV for the value,
adding up Increments as they are encountered; or until search ends, in which case the Increment
is treated as a Put.

* As a beneficial side effect, with Increments as just another KV type we can unify Put and
Increment handling.

Because this is a core concern I'd prefer discussing this as a possible enhancement of core
as opposed to a Coprocessor-based extension. However it could be possible to implement all
but the KV changes within the Coprocessor framework.



> Incremental bulk load support for Increments
> --------------------------------------------
>
>                 Key: HBASE-3936
>                 URL: https://issues.apache.org/jira/browse/HBASE-3936
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>             Fix For: 0.94.0
>
>
> From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a MapReduce
job to output table data in HBase's internal data format, and then directly loads the data
files into a running cluster. Using bulk load will use less CPU and network than going via
the HBase API."
> I have been working with a specific implementation of, and can envision, a class of applications
that reduce data into a large collection of counters, perhaps building projections of the
data in many dimensions in the process. One can use Hadoop MapReduce as the engine to accomplish
this for a given data set and use LoadIncrementalHFiles to move the result into place for
live serving. MR is natural for summation over very large counter sets: emit counter increments
for the data set and projections thereof in mappers, use combiners for partial aggregation,
use reducers to do final summation into HFiles.
> However, it is not possible to then merge in a set of updates to an existing table built
in the manner above without either 1) joining the table data and the update set into a large
MR temporary set, followed by a complete rewrite of the table; or 2) posting all of the updates
as Increments via the HBase API, impacting any other concurrent users of the HBase service,
and perhaps taking 10-100 times longer than if updates could be computed directly into HFiles
like the original import. Both of these alternatives are expensive in terms of CPU and time;
one is also expensive in terms of disk.
> I propose adding incremental bulk load support for Increments. Here is a sketch of a
possible implementation:
> * Add a KV type for Increment
> * Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles directly
to handle the new KV type
> * Bulk load API can move the files to be merged into the Stores as before.
> * Implement an alternate compaction algorithm or modify the existing. Need to identify
Increments and apply them to an existing most recent version of a value, or create the value
if it does not exist.
>   ** Use KeyValueHeap as is to merge value-sets by row as before.
>   ** For each row, use a KV-keyed Map for in memory update of values.
>   ** If there is an existing value and it is not a serialized long, ignore the Increment
and log at INFO level.
>   ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with an appropriate
memory limit, so work for overlarge rows will spill to disk. Can be local disk, not HDFS.
> * Never return an Increment KV to a client doing a Get or Scan. 
>   ** Before the merge is complete, if we find an Increment KV when searching Store files
for a value, continue searching back in the Store files until we find a Put KV for the value,
adding up Increments as they are encountered, then applying them to the Put value; or until
search ends, in which case the Increment is treated as a Put.
>   ** If there is an existing value and it is not a serialized long, ignore the Increment
and log at INFO level.
> * As a beneficial side effect, with Increments as just another KV type we can unify Put
and Increment handling.
> Because this is a core concern I'd prefer discussing this as a possible enhancement of
core as opposed to a Coprocessor-based extension. However it could be possible to implement
all but the KV changes within the Coprocessor framework.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message