Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 25 Apr 2014 18:53:18 +0000 (UTC)
From: "Yu Li (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12707183.1396965256636.183129.1398451998177@arcas>
In-Reply-To: <JIRA.12707183.1396965256636@arcas>
References: <JIRA.12707183.1396965256636@arcas>
Subject: [jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper
 number set/control
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981441#comment-13981441 ] 

Yu Li commented on HBASE-10932:
-------------------------------

Hi [~jdcryans],

If we follow this logic, do you mean the "-m" option of DistCp also useless?

IMHO, the configuration of job scheduler in JT/Yarn is the server-side configuration, while the "-m" option is the client-side configuration, and both are necessary.

Back to the scheduler discussion, I believe job scheduler could only limit the max resource one user could use, and it depends on the user to decide how he uses the resource assigned to him. Like in the example you gave, what if the "slow" pool have 4 slots while only one user submit a rowcounter and he prefers only 2 maps running in parallel? I'm afraid asking the cluster operator to create another "slow" pool with only 2 slots is not a good solution.

In a common hbase ETL application, user would need to first do distcp, then bulkload, then rowcounter to check data integrity, and he would prefer distcp to run as fast as possible w/ low scan workload during rowcounter. In this case, he would need to submit the distcp job to the "fast" queue while the rowcounter job to the "slow" queue? And he also needs to get access to both queues...

Anyway, this is a real requirement from user in our product env, and I'm just trying to contribute this to community in case this can help other users. But if you still think it useless, just go ahead and close it, you're the boss after all. :-)

And no matter what decision made, thanks for your time on reviewing this JIRA and discussion.

> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>
>                 Key: HBASE-10932
>                 URL: https://issues.apache.org/jira/browse/HBASE-10932
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Minor
>         Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch
>
>
> The typical use case of RowCounter is to do some kind of data integrity checking, like after exporting some data from RDBMS to HBase, or from one HBase cluster to another, making sure the row(record) number matches. Such check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per region, and each mapper will send one scan request. Assuming the table is kind of big like having tens of regions, and the cpu core number of the whole MR cluster is also enough, the parallel scan requests sent by mapper would be a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional option "--maps" to specify mapper number, and make each mapper able to scan more than one region of the target table.


--
This message was sent by Atlassian JIRA
(v6.2#6252)