hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Custom Input Split
Date Wed, 22 Apr 2009 16:25:22 GMT
Oh, and the reason to use a MR job counting rows is because if many, a
single process would take too long (If you know you have a small table, use
the 'count' command in shell).

St.Ack

On Wed, Apr 22, 2009 at 9:06 AM, Stack <saint.ack@gmail.com> wrote:

> If you run
>
> ./bin/hadoop -jar hbase.jar rowcounter
>
> It will emit usage.  You are a smart fellow. I think you can take it from
> there.
>
> Stack
>
>
>
>
> On Apr 22, 2009, at 5:48, Rakhi Khatwani <rakhi.khatwani@gmail.com> wrote:
>
>  Hi Lars,
>>          Thanks for the suggesstion, I also figured out my problem using
>> TableInputFormatBase.
>>
>> but my table had only one region but i still wanted to split the input
>> into
>> 4 maps.
>> so i am basically overriding the getInputSplits() method in
>> TableInputFormatBase.
>>
>> One more question
>> is there any method in hbase API which can count the number of rows in a
>> table?
>> i tried googling it and all i came across is a RowCounter class which is a
>> mapreduce job to count the number of rows. but i really dont know how to
>> use
>> it. any suggestions?
>>
>> thanks,
>> Raakhi
>>
>>
>> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <lars@worldlingo.com> wrote:
>>
>>  Hi Rakhi,
>>>
>>> This is all done in the TableInputFormatBase class, which you can extend
>>> and then override the getSplits() function:
>>>
>>>
>>>
>>> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>>>
>>> This is where you can then specify how many rows per map are assigned.
>>> Really straight forward as I see it. I have used it to implement a
>>> special
>>> "only use N regions" support where I can run a sample subset against a MR
>>> job. For example only map 5 out if 8K regions of a table.
>>>
>>> The default one will always split all regions into N maps. Hence the
>>> recommendation to set the number of maps to the number of regions in a
>>> table. If you set it to something lower than it will split the regions
>>> into
>>> a smaller number but with more rows per map, i.e. each map gets more than
>>> one region to process.
>>>
>>> Look into the source of the above class and it should be obvious - I
>>> hope.
>>>
>>> Lars
>>>
>>>
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>  Hi,
>>>>   I have a table with N records,
>>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
>>>>   is there a way i can create my own custom input split so that i can
>>>> send 'n' records to each map??
>>>>  if there is a way, can i have a sample code snippet to gain better
>>>> understanding?
>>>>
>>>> Thanks
>>>> Raakhi.
>>>>
>>>>
>>>>
>>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message