hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bharath v <bharathvissapragada1...@gmail.com>
Subject Re: Question about MapReduce
Date Mon, 19 Oct 2009 17:23:15 GMT
Thanks for replying JG .. I have posted some Doubts inline .

On Mon, Oct 19, 2009 at 10:36 PM, Jonathan Gray <jlist@streamy.com> wrote:

> Are you currently being limited by network throughput?  I wouldn't become
> obsessed with data locality until it becomes the bottleneck.
>

 I was thinking that .. this method might be far more efficient (not sure ..
just guessing) compared to brute-force method where we read the entire
table2 in one of the mappers of table1 . I want check the performance of
both the approaches.


>
> Even the naive implementation of this would not be entirely simple... but
> then what do you do if the regions on that node changed during the course of
> the map (splits, reassigns, etc)?
>

Since we can scan .META. and get the start and key of a particular region
and build scanners for them .. I thought it would be easy .,, Any hint why
it can become complex ?

>
> I would imagine you'll have other things to optimize well before network
> throughput becomes an issue.  And if you do go down the route of this kind
> of (potential) hyper-optimization, you'll need to be aware of the hardware
> you're using and the performance impact of different approaches.  If you
> only have a single disk, then concurrent scans of two different tables can
> cause disk contention, etc...


> Are you joining 2 tables by matching row key to row key?  If so, then this
> sounds like 2 tables that should be 1 table with multiple families (that's
> really the value in multiple families... each family is really like a
> separate table, but they are easily joined together by row key).
>

I wanted to implement a Join of 2 tables based on any columnfamily ..
(somewhat similar to database Join)




>  JG
>
>
> bharath v wrote:
>
>> Kevin : What if i want to implement a Join of 2 tables . Is there an
>> alternative to TableInputFormat (TIF) because it reads a single table at a
>> time . I thought of a solution ,but Iam not sure whether it works fine .
>>
>> Suppose we want to join table1 and table2 and we use TIF on table1 and the
>> Map phase is as follows .
>>
>> Map :
>>
>> Suppose the TIF is reading the region1 of table1. Then we can IN SOME WAY
>> get the regions start and end keys corresponding to the table2 on that
>> system (if any) where map is being executed
>> and read the table2 contents in the Map . This is in some way preserving
>> DATA LOCALITY..
>>
>> Is this feasible ? Any comments ?
>>
>>
>>
>> On Fri, Oct 16, 2009 at 12:09 AM, Kevin Peterson <kpeterson@biz360.com
>> >wrote:
>>
>>  On Thu, Oct 15, 2009 at 11:30 AM, Something Something <
>>> luckyguy2050@yahoo.com> wrote:
>>>
>>>  1) I don't think TableInputFormat is useful in this case.  Looks like
>>>>
>>> it's
>>>
>>>> used for scanning columns from a single HTable.
>>>> 2) TableMapReduceUtil - same problem.  Seems like this works with just
>>>>
>>> one
>>>
>>>> table.
>>>> 3) JV recommended NLineInputFormat, but my parameters are not in a file.
>>>>  They come from multiple files and are in memory.
>>>>
>>>> I guess what I am looking for is something like...
>>>> InMemoryInputFormat...
>>>> similar to FileInputFormat & DbInputFormat.  There's no such class right
>>>> now.
>>>>
>>>> Worse comes to worst, I can write the parameters into a flat file, and
>>>>
>>> use
>>>
>>>> FileInputFormat - but that will slow down this process considerably.  Is
>>>> there no other way?
>>>>
>>>> So you need to pull input from multiple tables at once? Are you
>>>> expecting
>>>>
>>> to do a join on these tables? If you explain what the data looks like,
>>> we'd
>>> understand better. What are your tables, and what would you like to treat
>>> as
>>> a single input record?
>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message