hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bharath v <bharathvissapragada1...@gmail.com>
Subject Re: MR in HBase
Date Mon, 11 Jan 2010 16:45:21 GMT
John,

I implemented this sometime back .. My need was something similar to yours
.. which involves scanning more than one table in Map at the same time .. as
you mentioned in an example . You just need to follow the steps as mentioned
by Mridul ..

You need to change getSplits() function and getRecordReader() funcs in such
a way that they can process 2 tables at the same time ..

if(split belongs to table1)
 {
           return  RR/InputSplits for t1
}
else
{
             return for t2;
}

You also need to change the way TableMapReduceUtil initiates the Map Job ..
You write your CustomTableMapReduceUtil..


Hope this helps in some way!

Thanks

V.Bharath
Btech-3rd year
IIIT-Hyderabad

On Mon, Jan 11, 2010 at 3:05 AM, Mridul Muralidharan
<mridulm@yahoo-inc.com>wrote:

>
> Unfortuantely I cant directly share our code ... but for an example, you
> can look at MultipleInputs and/or DelegatingInputFormat in hadoop.
>
> If you want a more sophisticated example, please take a look at pig
> subproject in hadoop (though that might be a bit too complicated to
> investigate the code of, for this simple usecase)
>
>
> Regards,
> Mridul
>
> john smith wrote:
>
>> Mridul ,
>>
>> It seems it is feasible , but Iam not 100% clear . Can you please show us
>> your implementation in hadoop so that we can get some idea and implement
>> the
>> same for HBase. Thanks for your help.
>>
>> J-S
>>
>> On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan
>> <mridulm@yahoo-inc.com>wrote:
>>
>>  Hi,
>>>
>>>
>>> This is assuming there is no easier way to do it (someone from hbase team
>>> can comment better !).
>>>
>>> But the usual way to handle this for mapreduce is to create a composite
>>> input format : which delegates to the underlying formats to generate the
>>> splits, and the corresponding record reader's based on the split.
>>>
>>>
>>> I have not done this for hbase though - but looking at
>>> TableInputFormatBase, it looks possible to implement ...
>>>
>>> Specifically for hbase, something along the lines of :
>>>
>>> --- start dirty pseudo code ---
>>>
>>> CustomTableInputFormat extends TableInputFormatBase and implements
>>> setConf() to configure the table(s) required.
>>>
>>> public class CustomTableInputFormat extends
>>> InputFormat<ImmutableBytesWritable, Result> {
>>>
>>>  private CustomTableInputFormat delegate1;
>>>  private CustomTableInputFormat delegate2;
>>>
>>>  public void setConf(){
>>>   delegate1 = createTable1InputFormat();
>>>   delegate2 = createTable2InputFormat();
>>>  }
>>>
>>>  public List<InputSplit> getSplits(JobContext context) throws IOException
>>> {
>>>   List<InputSplit> retval = new LinkedList<InputSplit>();
>>>   retval.addAll(delegate1.getSplits(context));
>>>   retval.addAll(delegate1.getSplits(context));
>>>   return retval;
>>>  }
>>>
>>>
>>>  public abstract
>>>   RecordReader<K,V> createRecordReader(InputSplit split,
>>>                                        TaskAttemptContext context
>>>                                       ) throws IOException,
>>>                                                InterruptedException {
>>>   if (split for table1) return delegate.createRecordReader();
>>>   else if (split for table2) return delegate.createRecordReader();
>>>   else throw exception
>>>  }
>>>
>>> }
>>>
>>> --- end pseudo code ---
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>> john smith wrote:
>>>
>>>  Mridul
>>>>
>>>> Can you be more clear .. I didn't get you !
>>>>
>>>> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
>>>> <mridulm@yahoo-inc.com>wrote:
>>>>
>>>>
>>>>  If you just want to scan both tables for your mapper, assuming there is
>>>>> no
>>>>> easier way to do it - cant you not write a composite input format which
>>>>> delegates to both tables input formats ?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>> john smith wrote:
>>>>>
>>>>>  Stack,
>>>>>
>>>>>> The requirement is that I need to I need to scan two tables A,B for
>>>>>>  an
>>>>>> MR
>>>>>> job ,Order is not important . That is , the reduce phase  contains
>>>>>> both
>>>>>> keys
>>>>>> from both A,B.
>>>>>>
>>>>>> Presently what iam doing is that I am using TableMap for "A" .. And
in
>>>>>> one
>>>>>> of the mappers , I am reading the entire B using a scanner. But this
>>>>>> is
>>>>>> a
>>>>>> big overhead right ! Because non-local  B data will we transferred
>>>>>> (over
>>>>>> network) to the machine executing that Map phase . Instead what
>>>>>> I was thinking is that , there is some kind of variant of TableMap
>>>>>> which
>>>>>> scans for both A,B and emit the corresponding keys . Order is not
at
>>>>>> all
>>>>>> important  and also no random lookups . I need the entire B table
keys
>>>>>> in
>>>>>> some way or the other with least overhead !
>>>>>>
>>>>>> Also therz one more solution I was thinking ..  Suppose Iam scanning
>>>>>> some
>>>>>> particular region using table map . I can get that particular region
>>>>>> names
>>>>>> using some func in the API , then I can build a scanner on B over
that
>>>>>> particular region and emit all the keys from B . This doesn't require
>>>>>> and
>>>>>> network transfer of data . Is this solution feasible ?? If yes any
>>>>>> hints
>>>>>> on
>>>>>> what classes to use from API ?
>>>>>>
>>>>>> Thanks ,
>>>>>> J-S
>>>>>>
>>>>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <stack@duboce.net> wrote:
>>>>>>
>>>>>>  This is a little tough.  Do both tables have same number of regions?
>>>>>>  Are
>>>>>>
>>>>>>  you walking through the two tables serially in your mapreduce or
do
>>>>>>> you
>>>>>>> want
>>>>>>> to do random lookups into the second table dependent on the row
you
>>>>>>> are
>>>>>>> currently processing in table one?
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js1987.smith@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi all,
>>>>>>>
>>>>>>>  My requirement is that , I must read two tables (belonging to
the
>>>>>>>> same
>>>>>>>> region server) in the same Map .
>>>>>>>>
>>>>>>>> Normally TableMap supports only 1 table at a time and right
now I am
>>>>>>>> reading
>>>>>>>> the entire 2nd table in any one
>>>>>>>> of the maps , This is a big overhead . So can any one suggest
some
>>>>>>>> modification of TableMap or a different
>>>>>>>> approach which can read 2 tables simultaneously at the same
time .
>>>>>>>> This
>>>>>>>>
>>>>>>>>  can
>>>>>>>>
>>>>>>>  be very useful to us!
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> J-S
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message