hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Reading in parallel from table's regions in MapReduce
Date Tue, 04 Sep 2012 16:41:28 GMT
I think the issue is that you are misinterpreting what you are seeing and what Doug was trying
to tell you...

The short simple answer is that you're getting one split per region. Each split is assigned
to a specific mapper task and that task will sequentially walk through the table finding the
rows that match your scan request. 

There is no lock or blocking. 

I think you really should actually read Lars George's book on HBase to get a better understanding.



On Sep 4, 2012, at 11:29 AM, Ioakim Perros <imperros@gmail.com> wrote:

> Thank you very much for your response and for the excellent reference.
> The thing is that I am running jobs on a distributed environment and beyond the TableMapReduceUtil
> I have just set the scan ' s caching to the number of rows I expect to retrieve at each
map task, and the scan's caching blocks feature to false (just as it is indicated at MapReduce
examples of HBase's homepage).
> I am not aware of such a job configuration (requesting jobtracker to execute more than
1 map tasks concurrently). Any other ideas?
> Thank you again and regards,
> ioakim
> On 09/04/2012 06:59 PM, Jerry Lam wrote:
>> Hi Loakim:
>> Sorry, your hypothesis doesn't make sense. I would suggest you to read the
>> "Learning HBase Internals" by Lars Hofhansl at
>> http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final
>> to
>> understand how HBase locking works.
>> Regarding to the issue you are facing, are you sure you configure the job
>> properly (i.e. requesting the jobtracker to have more than 1 mapper to
>> execute)? If you are testing on a single machine, you properly need to
>> configure the number of tasktracker per node as well to see more than 1
>> mapper to execute on a single machine.
>> my $0.02
>> Jerry
>> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <imperros@gmail.com> wrote:
>>> Hello,
>>> I would be grateful if someone could shed a light to the following:
>>> Each M/R map task is reading data from a separate region of a table.
>>> From the jobtracker 's GUI, at the map completion graph, I notice that
>>> although data read from mappers are different, they read data sequentially
>>> - like the table has a lock that permits only one mapper to read data from
>>> every region at a time.
>>> Does this "lock" hypothesis make sense? Is there any way I could avoid
>>> this useless delay?
>>> Thanks in advance and regards,
>>> Ioakim

View raw message