hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Linear scalability question
Date Thu, 09 Jun 2011 14:23:11 GMT
Shantian,

You are correct.  The other big factor in this is the cost of connections between the Mappers
and the Reducers.  With N mappers and M reducers you will make M*N connections between them.
 This can be a very large cost as well.  The basic tricks you can play are to filter data
before doing a join so that less data is passed through the network; make your block sizes
larger so that more data is processed per map and less connections are made between mappers
and reducers; and make sure you have compression enabled between the map/reduce phases usually
LZO is a good choice for this.  But for the most part this is a very difficult problem to
address because fundamentally a join requires transferring data with the same key to the same
node.  There is some experimental work into very large processing but it is still just that
experimental and does not actually try to make a join scale linearly.

--Bobby Evans

On 6/8/11 4:17 PM, "Shantian Purkad" <shantian_purkad@yahoo.com> wrote:

any comments?



________________________________
From: Shantian Purkad <shantian_purkad@yahoo.com>
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Sent: Tuesday, June 7, 2011 3:53 PM
Subject: Linear scalability question

Hi,

I have a question on the linear scalability of Hadoop.

We have a situation where we have to do reduce side joins on two big tables (10+ TB). This
causes lot of data to be transferred over network and network is becoming a bottleneck.

In few years these table will have 100TB + data and the reduce side joins will demand lot
of data transfer over network. Since network bandwidth is limited and can not be addressed
by adding more nodes, hadoop will no longer be linearly scalable in this case.

Is my understanding correct? Am I missing anything here? How do people address these kind
of bottlenecks?

Thanks and Regards,
Shantian


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message