hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1923) Bulk incremental load into an existing table
Date Thu, 20 May 2010 19:03:17 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869724#action_12869724
] 

Todd Lipcon commented on HBASE-1923:
------------------------------------

h1. Basic Design:

h2. Changes to HFileOutputFormat:

Should only need changes during job initialization:

# scan all regions of table from .META.
# configure TotalOrderPartitioner based on existing region key boundaries
# If the number of reducers > number of regions, we could
  (a) recursively split table until this is not true (degenerate case: incremental load into
table with one row?)
  (b) simply split keyspace by taking the lexical "halfway" of the region (two HFiles go into
one region in load stage)
  (c) add API to regionserver to get estimate of region midpoint (assuming that new data has
similar distribution to old data)

I plan to do either (a) or (b) initially.

We should provide at least some sample code, if not good utility classes/methods to do this
task.

h3. Job Running

Should be unaffected

h3. Data Loader

Note that the partitions output by the MR job no longer necessarily correspond to the region
boundaries (regions could have split or merged). I think the algorithm looks like:

{code}
for each reducer output:
  inspect hfile to find lowest key and highest key
  look up region name/startkey/endkey corresponding to first key in hfile
  if HFile's low<->high is entirely contained within regions low<->high:
    send RPC to RS: loadIncremental(region name, "/path/to/hfile")
  else:
    # this is the inefficient path, if the region split during the MR job
    On the loading side, manually split the HFile into two physical HFiles
      in a tmp directory
    recurse on the split files
{code}

The "inefficient" path should occur in a minority of cases. In the future we can implement
this path using reference files that would be cleaned at next compaction. I don't plan to
do this in the first pass.


The above functionality would be implemented in a client side script/program (currently a
ruby script, though I will probably just write in Java)

h3. RegionServer Side

Need to implement the "loadIncremental" RPC. This function needs to do the following reasonably
simple steps:
# ensure that the region being accessed is the same one being hosted (including timestamp,
etc)
# move the HFile into the correct store directory
# briefly lock the storefile list and add the HFile

Probably need some other interaction with concurrent scanners, etc - will look at this carefully
during implementation.


> Bulk incremental load into an existing table
> --------------------------------------------
>
>                 Key: HBASE-1923
>                 URL: https://issues.apache.org/jira/browse/HBASE-1923
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, mapred, regionserver, scripts
>    Affects Versions: 0.21.0
>            Reporter: anty.rao
>            Assignee: Todd Lipcon
>
> hbase-48 is about bulk load of a new table,maybe it's more practicable to bulk load aganist
a existing table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message