hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control
Date Fri, 25 Apr 2014 17:25:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981271#comment-13981271

Jean-Daniel Cryans commented on HBASE-10932:

bq. But I didn't quite catch the point of job scheduler, in my understanding job scheduler
is cluster-level and cannot be configured per-job, right? 

Well, by using a scheduler, you can constrain certain types of jobs so that they don't run
as fast as they can. For example, with the fair scheduler you can configure a pool (let's
call it the "slow pool") to have only {{maxMaps}} running concurrently on the cluster. Then,
when you run your {{RowCounter}} jobs and whatnot, you can tie them automatically to the slow
pool. Hadoop cluster operators usually know how to use a scheduler, whereas having to rely
on the person who runs the jobs to configure them correctly can lead to human errors like
"oops I forgot to pass the maps configuration to my row counter and now the website is down".

It also works well if you have two users who want to concurrently run a row counter; they'll
both get in the slow pool and only two mappers will run (alternating between the two jobs,
unless you set different weights because one user is more important than the other, etc etc).
If you were to rely on individual users specifying the correct number of maps, and they both
set their job to use two, then you'd have four mappers running. Back to square one.

Anyways, all of this to say that there's a more generic way of doing this, and it already
exists. Can we close this jira, [~carp84]?

> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>                 Key: HBASE-10932
>                 URL: https://issues.apache.org/jira/browse/HBASE-10932
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Minor
>         Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch
> The typical use case of RowCounter is to do some kind of data integrity checking, like
after exporting some data from RDBMS to HBase, or from one HBase cluster to another, making
sure the row(record) number matches. Such check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per region, and each
mapper will send one scan request. Assuming the table is kind of big like having tens of regions,
and the cpu core number of the whole MR cluster is also enough, the parallel scan requests
sent by mapper would be a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional option "--maps"
to specify mapper number, and make each mapper able to scan more than one region of the target

This message was sent by Atlassian JIRA

View raw message