hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Foley <ma...@yahoo-inc.com>
Subject Re: Sanity check re: value of 10GbE NICs for Hadoop?
Date Tue, 28 Jun 2011 19:04:17 GMT
Hadoop common provides an abstract FileSystem class, and Hadoop applications 
should be designed to run on that.  HDFS is just one implementation of a valid 
Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem
are provided in Hadoop common.  Use of NFS-mounted storage would fall under the 
LocalFileSystem model.

However, one of the core values of Hadoop is the model of "bring the computation
to the data".  This does not seem viable with an NFS-based NAS-model storage
subsystem.  Thus, while it will "work" for small clusters and small jobs, it is unlikely
to scale with high performance to thousands of nodes and petabytes of data in the 
way Hadoop can scale with HDFS or S3.

--Matt


On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

I see. However, Hadoop is designed to operate best with HDFS because
of its inherent striping and blocking strategy - which is tracked by Hadoop.
Going outside of that mechanism will probably yield poor results and/or
confuse Hadoop.

Just my thoughts.

On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
> Darren,
> Thanks, the last pt was basically about 10GbE potentially allowing the use
> of a network file system e.g. via NFS as an alternative to HDFS, the
> question
> is there any merit in this. Basically, I was exploring if the commercial
> clustered
> NAS products offer any high-availability or data management benefits for use
> with Hadoop?
> 
> Saqib
> 
> -----Original Message-----
> From: Darren Govoni [mailto:darren@ontrenet.com]
> Sent: Tuesday, June 28, 2011 10:21 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
> 
> Hadoop, like other parallel networked computation architectures is I/O
> bound, predominantly.
> This means any increase in network bandwidth is "A Good Thing" and can have
> drastic positive effects on performance. All your points stem from this
> simple realization.
> 
> Although I'm confused by your #6. Hadoop already uses a distributed file
> system. HDFS.
> 
> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
>> Folks,
>> 
>> I've been digging into the potential benefits of using
>> 
>> 10 Gigabit Ethernet (10GbE) NIC server connections for
>> 
>> Hadoop and wanted to run what I've come up with
>> 
>> through initial research by the list for 'sanity check'
>> 
>> feedback. I'd very much appreciate your input on
>> 
>> the importance (or lack of it) of the following potential benefits of
>> 
>> 10GbE server connectivity as well as other thoughts regarding
>> 
>> 10GbE and Hadoop (My interest is specifically in the value
>> 
>> of 10GbE server connections and 10GbE switching infrastructure,
>> 
>> over scenarios such as bonded 1GbE server connections with
>> 
>> 10GbE switching).
>> 
>> 
>> 
>> 1.       HDFS Data Loading. The higher throughput enabled by 10GbE
>> 
>> server and switching infrastructure allows faster processing and
>> 
>> distribution of data.
>> 
>> 2.       Hadoop Cluster Scalability. High-performance for initial data
>> processing
>> 
>> and distribution directly impacts the degree of parallelism or
>> scalability supported
>> 
>> by the cluster.
>> 
>> 3.       HDFS Replication. Higher speed server connections allows faster
>> file replication.
>> 
>> 4.       Map/Reduce Shuffle Phase. Improved end-to-end throughput and
>> latency directly impact the
>> 
>> shuffle phase of a data set reduction especially for tasks that are at
>> the document level
>> 
>> (including large documents) and lots of metadata generated by those
>> documents as well as video analytics and images.
>> 
>> 5.       Data Reporting. 10GbE server networking etwork performance can
>> 
>> improve data reporting performance, especially if the Hadoop cluster
>> is running
>> 
>> multiple data reductions.
>> 
>> 6.       Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could
> be
>> reorganized
>> 
>> to use a cluster or network file system. This would allow Hadoop even
>> with its Java implementation
>> 
>> to have higher performance I/O and not have to be so concerned with
>> disk drive density in the same server.
>> 
>> 7.       Others?
>> 
>> 
>> 
>> 
>> 
>> thanks,
>> 
>> Saqib
>> 
>> 
>> 
>> Saqib Jang
>> 
>> Principal/Founder
>> 
>> Margalla Communications, Inc.
>> 
>> 1339 Portola Road, Woodside, CA 94062
>> 
>> (650) 274 8745
>> 
>> www.margallacomm.com
>> 
>> 
>> 
>> 
>> 
>> 
> 



Mime
View raw message