Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8781810DC6 for ; Wed, 14 Jan 2015 16:21:46 +0000 (UTC) Received: (qmail 90443 invoked by uid 500); 14 Jan 2015 16:21:44 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 90249 invoked by uid 500); 14 Jan 2015 16:21:44 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 90225 invoked by uid 99); 14 Jan 2015 16:21:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Jan 2015 16:21:43 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gokul.soundar@gmail.com designates 209.85.218.43 as permitted sender) Received: from [209.85.218.43] (HELO mail-oi0-f43.google.com) (209.85.218.43) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Jan 2015 16:21:40 +0000 Received: by mail-oi0-f43.google.com with SMTP id i138so8057229oig.2; Wed, 14 Jan 2015 08:20:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=FVV0/wMiAGxahielmdwCkRyjloJuF3uJUdztr5alM7M=; b=CMrMaOHgXfzY61w8pUZC7VsS8GzdB5PInAwRVReZvMDzQGvn2oR1v0P0ZzAl8qVCsh lfXxcYgudZyfPtEwSlXThOhVAlAloai6O5DHWsh5YFLGeCP6Cnu7rU0+JPZlWaxTyooP 3Mig0k5X8Y0ZgFXxlq+bXRP/9tmBb4d2TvWWtPSxccn4OjNhGZ9BeH+C0PsRIZJC4hQc ZJ1wZB2vpBluP9yFGqDqFfRQE73qFbStyVWOwTWBhj0AiCywk5gs4X9UrkGJRvWPi7vg n3l4F/jIodclbKNNn5dW0gjXSRQskUw1EuDTNJP/9xdiF7WaxvRAHXyakLRDfMiKGJtb 3jzg== X-Received: by 10.202.230.145 with SMTP id d139mr2911877oih.8.1421252434412; Wed, 14 Jan 2015 08:20:34 -0800 (PST) MIME-Version: 1.0 Sender: gokul.soundar@gmail.com Received: by 10.182.33.166 with HTTP; Wed, 14 Jan 2015 08:20:14 -0800 (PST) In-Reply-To: References: From: Gokul Soundararajan Date: Wed, 14 Jan 2015 08:20:14 -0800 X-Google-Sender-Auth: 4HDh1IeYab8WjevRgZ040idGMmg Message-ID: Subject: Re: NFSv3 Filesystem Connector To: common-dev@hadoop.apache.org Cc: hdfs-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a1141745a7d4d30050c9f1e51 X-Virus-Checked: Checked by ClamAV on apache.org --001a1141745a7d4d30050c9f1e51 Content-Type: text/plain; charset=UTF-8 Hi Niels, Thanks for your comments. My goal in designing the NFS connector is *not* to replace HDFS. HDFS is ideally suited for Hadoop (otherwise why was it built?). The problem is that we have people who have PBs (10PB to 50PB) of data on NFS storage that they would like process using Hadoop. Such amount of data is both time-consuming and costly to move around; some have used Sqoop and Flume but it is still painful. To help these folks, we built the connector to enable Hadoop analytics on this data. As NFS is an open standard, we believe that it would benefit everyone who have this use case. Regarding the performance point, I hope you don't think a NFS storage server is a box with several disks and a single network connection. The latest generation storage servers are clustered storage systems that can have 17000+ drives , holding 100PBs, and support 64 10GbE ports on each cluster node. The NetApp spec sheet is here: http://www.netapp.com/us/products/storage-systems/fas8000/fas8000-tech-specs.aspx I hope this clarifies why we want to make this contribution. It is to unlock additional data that can be processed by Hadoop. Thanks, Gokul On Wed, Jan 14, 2015 at 3:14 AM, Niels Basjes wrote: > Hi, > > The main reason Hadoop scales so well is because all components try to > adhere to the idea around having Data Locality. > In general this means that you are running the processing/query software on > the system where the actual data is already present on the local disk. > > To me this NFS solution sounds like hooking the processing nodes to a > shared storage solution. > This may work for small clusters (say 5 nodes or so) but for large clusters > this shared storage will be the main bottle neck in the processing/query > speed. > > We currently have more than 20 nodes with 12 harddisks each resulting in > over 50GB/sec [1] of disk-to-queryengine speed and this means that our > setup already goes much faster than any network connection to any NFS > solution can provide. We can simply go to say 50 nodes and exceed the > 100GB/sec speed easy. > > So to me this sounds like hooking a scalable processing platform to a non > scalable storage system (mainly because the network to this storage doesn't > scale). > > So far I have only seen vendors of legacy storage solutions going in this > direction ... oh wait ... you are NetApp ... that explains it. > > I am no committer in any of the Hadoop tools but I vote against having such > a "core concept breaking" piece in the main codebase. New people may start > to think it is a good idea to do this. > > So I say you should simply make this plugin available to your customers, > just not as a core part of Hadoop. > > Niels Basjes > > [1] 50 GB/sec = approx 20*12*200MB/sec > This page shows max read speed in the 200MB/sec range: > > > http://www.tomshardware.com/charts/enterprise-hdd-charts/-02-Read-Throughput-Maximum-h2benchw-3.16,3372.html > > > On Tue, Jan 13, 2015 at 10:35 PM, Gokul Soundararajan < > gokulsoundar@gmail.com> wrote: > > > Hi, > > > > We (Jingxin Feng, Xing Lin, and I) have been working on providing a > > FileSystem implementation that allows Hadoop to utilize a NFSv3 storage > > server as a filesystem. It leverages code from hadoop-nfs project for all > > the request/response handling. We would like your help to add it as part > of > > hadoop tools (similar to the way hadoop-aws and hadoop-azure). > > > > In more detail, the Hadoop NFS Connector allows Apache Hadoop (2.2+) and > > Apache Spark (1.2+) to use a NFSv3 storage server as a storage endpoint. > > The NFS Connector can be run in two modes: (1) secondary filesystem - > where > > Hadoop/Spark runs using HDFS as its primary storage and can use NFS as a > > second storage endpoint, and (2) primary filesystem - where Hadoop/Spark > > runs entirely on a NFSv3 storage server. > > > > The code is written in a way such that existing applications do not have > to > > change. All one has to do is to copy the connector jar into the lib/ > > directory of Hadoop/Spark. Then, modify core-site.xml to provide the > > necessary details. > > > > The current version can be seen at: > > https://github.com/NetApp/NetApp-Hadoop-NFS-Connector > > > > It is my first time contributing to the Hadoop codebase. It would be > great > > if someone on the Hadoop team can guide us through this process. I'm > > willing to make the necessary changes to integrate the code. What are the > > next steps? Should I create a JIRA entry? > > > > Thanks, > > > > Gokul > > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes > --001a1141745a7d4d30050c9f1e51--