Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 067949743 for ; Thu, 13 Oct 2011 12:49:14 +0000 (UTC) Received: (qmail 89738 invoked by uid 500); 13 Oct 2011 12:49:12 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 89635 invoked by uid 500); 13 Oct 2011 12:49:11 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 89623 invoked by uid 99); 13 Oct 2011 12:49:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2011 12:49:11 +0000 X-ASF-Spam-Status: No, hits=-1.6 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [192.6.10.2] (HELO colossus.hpl.hp.com) (192.6.10.2) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2011 12:49:04 +0000 Received: from localhost (localhost [127.0.0.1]) by colossus.hpl.hp.com (Postfix) with ESMTP id 2E5051BA70D for ; Thu, 13 Oct 2011 13:48:43 +0100 (BST) X-Virus-Scanned: Debian amavisd-new at hpl.hp.com Received: from colossus.hpl.hp.com ([127.0.0.1]) by localhost (colossus.hpl.hp.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Bb5YX9bkhOMV for ; Thu, 13 Oct 2011 13:48:42 +0100 (BST) Received: from 0-imap-br1.hpl.hp.com (0-imap-br1.hpl.hp.com [16.25.144.60]) by colossus.hpl.hp.com (Postfix) with ESMTP id 9A6691BA70C for ; Thu, 13 Oct 2011 13:48:42 +0100 (BST) MailScanner-NULL-Check: 1319114907.67172@zNgIPuFLbDyWufrinNr6Bg Received: from [16.25.175.86] (wildhaus.hpl.hp.com [16.25.175.86]) by 0-imap-br1.hpl.hp.com (8.14.1/8.13.4) with ESMTP id p9DCmQ4Q008451 for ; Thu, 13 Oct 2011 13:48:26 +0100 (BST) Message-ID: <4E96DE1A.4030900@apache.org> Date: Thu, 13 Oct 2011 13:48:26 +0100 From: Steve Loughran User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Thunderbird/3.1.15 MIME-Version: 1.0 To: general@hadoop.apache.org Subject: Re: [ANN] Plasma MapReduce, PlasmaFS, version 0.4 References: <1318437111.16477.228.camel@thinkpad> In-Reply-To: <1318437111.16477.228.camel@thinkpad> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-HPL-MailScanner-Information: Please contact the ISP for more information X-MailScanner-ID: p9DCmQ4Q008451 X-HPL-MailScanner: Found to be clean X-HPL-MailScanner-From: stevel@apache.org On 12/10/11 17:31, Gerd Stolpmann wrote: > Hi, > > This is about the release of Plasma-0.4, an alternate and independent > implementation of map/reduce with its own dfs. This might also be > interesting for Hadoop users and developers, because this project > incorporates a number of new ideas. So far, Plasma has proven to work on > smaller clusters and shows good signs of being scalable. The design of > PlasmaFS is certainly superior to that of HDFS - I did not want a > quick'n'dirty solution, so please have a look how to do it right. > > Concerning the features, these two pages compare Plasma and Hadoop: > > http://plasma.camlcity.org/plasma/dl/plasma-0.4/doc/html/Plasmafs_and_hdfs.html > - without block checksums your code contains assumptions about HDD integrity that does not stand up to the classic works by Pinhero or Schroeder. Essentially you appear to be assuming that HDDs don't corrupt data, yet both HDD and their interconnects can play up. For a recent summary of Hadoop integrity, I would point you at [Loughran2011] http://www.slideshare.net/steve_l/did-you-reallywantthatdata -Hadoop NNs benefit from SSD too. -auth and security has improved recently, though I'd still run it in a private subnet just to be sure > http://plasma.camlcity.org/plasma/dl/plasma-0.4/doc/html/Plasmamr_and_hadoop.html > > I hope you see where the point is. Again, support for small block size is relevant in small situations. In larger clusters you will not only have larger block sizes, if you do work on small blocks the sheer number of task trackers reporting back to the JT can overload it. > > I have currently only limited resources for testing my implementation. > If there is anybody interested in testing on bigger clusters, please let > me know. That's one of the issues with the Plasma design: I'm not sure how well things like Posix semantics, esp. locking and writes with offsets scale. That's why the very large filesystems, HDFS included, tend to drop them. Look at how much effort it took to get Append to work reliably. Without evidence of working at scale, I'm not sure how the claim "the design of Plasma is certainly superior to HDFS" is defensible. Sorry. That said, using SunOS RPC/NFS as an FS protocol is nice as it does make mounting straightforward. And as NFS locking isn't guaranteed in NFS, you may be able to get away without it.