hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yu Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control
Date Thu, 10 Apr 2014 03:05:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964924#comment-13964924

Yu Li commented on HBASE-10932:

Hi [~jdcryans]
What makes RowCounter so special that it's the only MR job that would beneficiate from this
I was pointing at the old TableInputFormatBase to show that it used to do this, and that the
new one doesn't do it 
Ok, got your point now. And yes, we could remove the special InputFormat for RowCounter and
_*fix*_ the new TableInputFormatBase. I created the special InputFormat for RowCounter just
because from the comments of the new TableInputFormatBase's getSplits method, I thought it's
designed for purpose to make each mapper just scan one single region...

I'm guessing because MR doesn't pass mapred.map.tasks as a hint anymore
In my understanding, it still passes mapred.map.tasks as a hint, only that the param is contained
in the JobContext, so no need of a special int param for getSplits any more.
Regarding the parameter to pass the mapred.map.tasks hint, I'm referring to distcp command,
it has a special "-m" param there:
usage: distcp OPTIONS [source_path...] <target_path>
-m <arg>               Max number of concurrent maps to use for copy

Well there's nothing preventing the JobTracker from filling up 4 machines and leave one quiet
Oh, there's some misunderstanding here. While talking about "real burden for the HBase cluster",
I didn't mean CPU burden caused by MR job but IO burden caused by scan requests. If we have
25 mappers there would be 25 scan requests, while w/ 20 mappers there would only be 20 scan
requests. This is useful especially in multi-tenant env, when we need to check data integrity
for one user after data importing meanwhile don't want the scan burden to slow down RT of
other users' request. Makes sense? :-)

> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>                 Key: HBASE-10932
>                 URL: https://issues.apache.org/jira/browse/HBASE-10932
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Minor
>         Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch
> The typical use case of RowCounter is to do some kind of data integrity checking, like
after exporting some data from RDBMS to HBase, or from one HBase cluster to another, making
sure the row(record) number matches. Such check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per region, and each
mapper will send one scan request. Assuming the table is kind of big like having tens of regions,
and the cpu core number of the whole MR cluster is also enough, the parallel scan requests
sent by mapper would be a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional option "--maps"
to specify mapper number, and make each mapper able to scan more than one region of the target

This message was sent by Atlassian JIRA

View raw message