hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: MR in HBase
Date Sun, 10 Jan 2010 21:35:58 GMT

Unfortuantely I cant directly share our code ... but for an example, you 
can look at MultipleInputs and/or DelegatingInputFormat in hadoop.

If you want a more sophisticated example, please take a look at pig 
subproject in hadoop (though that might be a bit too complicated to 
investigate the code of, for this simple usecase)

Regards,
Mridul

john smith wrote:
> Mridul ,
> 
> It seems it is feasible , but Iam not 100% clear . Can you please show us
> your implementation in hadoop so that we can get some idea and implement the
> same for HBase. Thanks for your help.
> 
> J-S
> 
> On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan
> <mridulm@yahoo-inc.com>wrote:
> 
>> Hi,
>>
>>
>> This is assuming there is no easier way to do it (someone from hbase team
>> can comment better !).
>>
>> But the usual way to handle this for mapreduce is to create a composite
>> input format : which delegates to the underlying formats to generate the
>> splits, and the corresponding record reader's based on the split.
>>
>>
>> I have not done this for hbase though - but looking at
>> TableInputFormatBase, it looks possible to implement ...
>>
>> Specifically for hbase, something along the lines of :
>>
>> --- start dirty pseudo code ---
>>
>> CustomTableInputFormat extends TableInputFormatBase and implements
>> setConf() to configure the table(s) required.
>>
>> public class CustomTableInputFormat extends
>> InputFormat<ImmutableBytesWritable, Result> {
>>
>>  private CustomTableInputFormat delegate1;
>>  private CustomTableInputFormat delegate2;
>>
>>  public void setConf(){
>>    delegate1 = createTable1InputFormat();
>>    delegate2 = createTable2InputFormat();
>>  }
>>
>>  public List<InputSplit> getSplits(JobContext context) throws IOException {
>>    List<InputSplit> retval = new LinkedList<InputSplit>();
>>    retval.addAll(delegate1.getSplits(context));
>>    retval.addAll(delegate1.getSplits(context));
>>    return retval;
>>  }
>>
>>
>>  public abstract
>>    RecordReader<K,V> createRecordReader(InputSplit split,
>>                                         TaskAttemptContext context
>>                                        ) throws IOException,
>>                                                 InterruptedException {
>>    if (split for table1) return delegate.createRecordReader();
>>    else if (split for table2) return delegate.createRecordReader();
>>    else throw exception
>>  }
>>
>> }
>>
>> --- end pseudo code ---
>>
>>
>> Regards,
>> Mridul
>>
>> john smith wrote:
>>
>>> Mridul
>>>
>>> Can you be more clear .. I didn't get you !
>>>
>>> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
>>> <mridulm@yahoo-inc.com>wrote:
>>>
>>>
>>>> If you just want to scan both tables for your mapper, assuming there is
>>>> no
>>>> easier way to do it - cant you not write a composite input format which
>>>> delegates to both tables input formats ?
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> john smith wrote:
>>>>
>>>>  Stack,
>>>>> The requirement is that I need to I need to scan two tables A,B for 
an
>>>>> MR
>>>>> job ,Order is not important . That is , the reduce phase  contains both
>>>>> keys
>>>>> from both A,B.
>>>>>
>>>>> Presently what iam doing is that I am using TableMap for "A" .. And in
>>>>> one
>>>>> of the mappers , I am reading the entire B using a scanner. But this
is
>>>>> a
>>>>> big overhead right ! Because non-local  B data will we transferred (over
>>>>> network) to the machine executing that Map phase . Instead what
>>>>> I was thinking is that , there is some kind of variant of TableMap which
>>>>> scans for both A,B and emit the corresponding keys . Order is not at
all
>>>>> important  and also no random lookups . I need the entire B table keys
>>>>> in
>>>>> some way or the other with least overhead !
>>>>>
>>>>> Also therz one more solution I was thinking ..  Suppose Iam scanning
>>>>> some
>>>>> particular region using table map . I can get that particular region
>>>>> names
>>>>> using some func in the API , then I can build a scanner on B over that
>>>>> particular region and emit all the keys from B . This doesn't require
>>>>> and
>>>>> network transfer of data . Is this solution feasible ?? If yes any hints
>>>>> on
>>>>> what classes to use from API ?
>>>>>
>>>>> Thanks ,
>>>>> J-S
>>>>>
>>>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <stack@duboce.net> wrote:
>>>>>
>>>>>  This is a little tough.  Do both tables have same number of regions?
>>>>>  Are
>>>>>
>>>>>> you walking through the two tables serially in your mapreduce or
do you
>>>>>> want
>>>>>> to do random lookups into the second table dependent on the row you
are
>>>>>> currently processing in table one?
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js1987.smith@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Hi all,
>>>>>>
>>>>>>> My requirement is that , I must read two tables (belonging to
the same
>>>>>>> region server) in the same Map .
>>>>>>>
>>>>>>> Normally TableMap supports only 1 table at a time and right now
I am
>>>>>>> reading
>>>>>>> the entire 2nd table in any one
>>>>>>> of the maps , This is a big overhead . So can any one suggest
some
>>>>>>> modification of TableMap or a different
>>>>>>> approach which can read 2 tables simultaneously at the same time
.
>>>>>>> This
>>>>>>>
>>>>>>>  can
>>>>>>  be very useful to us!
>>>>>>> Thanks
>>>>>>> J-S
>>>>>>>
>>>>>>>
>>>>>>>


Mime
View raw message