hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Question about MapReduce
Date Mon, 19 Oct 2009 17:06:14 GMT
Are you currently being limited by network throughput?  I wouldn't 
become obsessed with data locality until it becomes the bottleneck.

Even the naive implementation of this would not be entirely simple... 
but then what do you do if the regions on that node changed during the 
course of the map (splits, reassigns, etc)?

I would imagine you'll have other things to optimize well before network 
throughput becomes an issue.  And if you do go down the route of this 
kind of (potential) hyper-optimization, you'll need to be aware of the 
hardware you're using and the performance impact of different 
approaches.  If you only have a single disk, then concurrent scans of 
two different tables can cause disk contention, etc...

Are you joining 2 tables by matching row key to row key?  If so, then 
this sounds like 2 tables that should be 1 table with multiple families 
(that's really the value in multiple families... each family is really 
like a separate table, but they are easily joined together by row key).


bharath v wrote:
> Kevin : What if i want to implement a Join of 2 tables . Is there an
> alternative to TableInputFormat (TIF) because it reads a single table at a
> time . I thought of a solution ,but Iam not sure whether it works fine .
> Suppose we want to join table1 and table2 and we use TIF on table1 and the
> Map phase is as follows .
> Map :
> Suppose the TIF is reading the region1 of table1. Then we can IN SOME WAY
> get the regions start and end keys corresponding to the table2 on that
> system (if any) where map is being executed
> and read the table2 contents in the Map . This is in some way preserving
> Is this feasible ? Any comments ?
> On Fri, Oct 16, 2009 at 12:09 AM, Kevin Peterson <kpeterson@biz360.com>wrote:
>> On Thu, Oct 15, 2009 at 11:30 AM, Something Something <
>> luckyguy2050@yahoo.com> wrote:
>>> 1) I don't think TableInputFormat is useful in this case.  Looks like
>> it's
>>> used for scanning columns from a single HTable.
>>> 2) TableMapReduceUtil - same problem.  Seems like this works with just
>> one
>>> table.
>>> 3) JV recommended NLineInputFormat, but my parameters are not in a file.
>>>  They come from multiple files and are in memory.
>>> I guess what I am looking for is something like... InMemoryInputFormat...
>>> similar to FileInputFormat & DbInputFormat.  There's no such class right
>>> now.
>>> Worse comes to worst, I can write the parameters into a flat file, and
>> use
>>> FileInputFormat - but that will slow down this process considerably.  Is
>>> there no other way?
>>> So you need to pull input from multiple tables at once? Are you expecting
>> to do a join on these tables? If you explain what the data looks like, we'd
>> understand better. What are your tables, and what would you like to treat
>> as
>> a single input record?

View raw message