hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ioakim Perros <imper...@gmail.com>
Subject Re: Reading in parallel from table's regions in MapReduce
Date Tue, 04 Sep 2012 17:15:03 GMT
Jerry thank you very much for the links.

Regards,
Ioakim

On 09/04/2012 08:05 PM, Jerry Lam wrote:
> Hi Loakim:
>
> Here a list of links I would suggest you to read (I know it is a lot to
> read):
> HBase Related:
> -
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
> -
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
> - make sure to read the examples:
> http://hbase.apache.org/book/mapreduce.example.html
>
> Hadoop Related:
> - http://wiki.apache.org/hadoop/JobTracker
> - http://wiki.apache.org/hadoop/TaskTracker
> - http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html
> - Some Configurations:
> http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
>
> HTH,
>
> Jerry
>
>
> On Tue, Sep 4, 2012 at 12:41 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> I think the issue is that you are misinterpreting what you are seeing and
>> what Doug was trying to tell you...
>>
>> The short simple answer is that you're getting one split per region. Each
>> split is assigned to a specific mapper task and that task will sequentially
>> walk through the table finding the rows that match your scan request.
>>
>> There is no lock or blocking.
>>
>> I think you really should actually read Lars George's book on HBase to get
>> a better understanding.
>>
>> HTH
>>
>> -Mike
>>
>> On Sep 4, 2012, at 11:29 AM, Ioakim Perros <imperros@gmail.com> wrote:
>>
>>> Thank you very much for your response and for the excellent reference.
>>>
>>> The thing is that I am running jobs on a distributed environment and
>> beyond the TableMapReduceUtil settings,
>>> I have just set the scan ' s caching to the number of rows I expect to
>> retrieve at each map task, and the scan's caching blocks feature to false
>> (just as it is indicated at MapReduce examples of HBase's homepage).
>>> I am not aware of such a job configuration (requesting jobtracker to
>> execute more than 1 map tasks concurrently). Any other ideas?
>>> Thank you again and regards,
>>> ioakim
>>>
>>>
>>> On 09/04/2012 06:59 PM, Jerry Lam wrote:
>>>> Hi Loakim:
>>>>
>>>> Sorry, your hypothesis doesn't make sense. I would suggest you to read
>> the
>>>> "Learning HBase Internals" by Lars Hofhansl at
>>>>
>> http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final
>>>> to
>>>> understand how HBase locking works.
>>>>
>>>> Regarding to the issue you are facing, are you sure you configure the
>> job
>>>> properly (i.e. requesting the jobtracker to have more than 1 mapper to
>>>> execute)? If you are testing on a single machine, you properly need to
>>>> configure the number of tasktracker per node as well to see more than 1
>>>> mapper to execute on a single machine.
>>>>
>>>> my $0.02
>>>>
>>>> Jerry
>>>>
>>>> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <imperros@gmail.com>
>> wrote:
>>>>> Hello,
>>>>>
>>>>> I would be grateful if someone could shed a light to the following:
>>>>>
>>>>> Each M/R map task is reading data from a separate region of a table.
>>>>>  From the jobtracker 's GUI, at the map completion graph, I notice that
>>>>> although data read from mappers are different, they read data
>> sequentially
>>>>> - like the table has a lock that permits only one mapper to read data
>> from
>>>>> every region at a time.
>>>>>
>>>>> Does this "lock" hypothesis make sense? Is there any way I could avoid
>>>>> this useless delay?
>>>>>
>>>>> Thanks in advance and regards,
>>>>> Ioakim
>>>>>
>>>
>>


Mime
View raw message