hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <o...@yahoo-inc.com>
Subject Re: silly question: why http for map output?
Date Thu, 01 Jun 2006 16:04:57 GMT

On Jun 1, 2006, at 5:06 AM, Stefan Groschupf wrote:

> Hi Owen, Hi All,
> a silly question, please give me some glue.
> Why  we use now http for mapoutput transfer instead of tcp or the dfs 
> itself?
> Sorry but the issue HADOOP-254 doesn't give very much information just 
> that it is faster, what surprise me a little bit.

It is a good question. My first thought was to use a mini-ftp server, 
but http was fast, standard, and desired for the task trackers anyways. 
Once you have jetty running in the task tracker, it was by far the 
easiest way to get the new protocol up and running smoothly.

In terms of DFS, it should be doable, but the performance would suffer. 
You could make it better by setting the replication on the files to 1 
to minimize the costs of the replication. But there are a lot (M*R) of 
little files that are flying around the system. Little files really 
aren't the space where DFS shines. For my 200 node sorter, I'm running 
with 16080*700=11,256,000. Clearly there are changes to the framework 
such as writing a single file from each map that is partitioned by 
reduce that would help, but we don't have those yet.

-- Owen

View raw message