Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 39947 invoked from network); 23 Apr 2008 16:28:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Apr 2008 16:28:49 -0000 Received: (qmail 35220 invoked by uid 500); 23 Apr 2008 16:28:48 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 35197 invoked by uid 500); 23 Apr 2008 16:28:48 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 35186 invoked by uid 99); 23 Apr 2008 16:28:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Apr 2008 09:28:48 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Apr 2008 16:28:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7BF3A234C0FF for ; Wed, 23 Apr 2008 09:25:21 -0700 (PDT) Message-ID: <376156647.1208967921506.JavaMail.jira@brutus> Date: Wed, 23 Apr 2008 09:25:21 -0700 (PDT) From: "Doug Cutting (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3288) Serial streaming performance should be Math.min(ideal client performance, ideal serial hdfs performance) In-Reply-To: <94897957.1208758041505.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591673#action_12591673 ] Doug Cutting commented on HADOOP-3288: -------------------------------------- > wouldn't most of negatives of RAID0 that Doug mentioned apply to this as well? The first negative (lower cluster throughput) definitely would, but that's the point: instead of always paying that penalty, as you would with RAID 0, an optional read-ahead feature would let clients declare when latency should be prioritized ahead of throughput. So mapreduce jobs probably wouldn't specify read-ahead, but 'dfs cat' might. Whether adding an optional read-ahead feature is worth the added complexity is an open question. If it can be done simply and provides significant speedup, then it probably is, otherwise not. > Serial streaming performance should be Math.min(ideal client performance, ideal serial hdfs performance) > -------------------------------------------------------------------------------------------------------- > > Key: HADOOP-3288 > URL: https://issues.apache.org/jira/browse/HADOOP-3288 > Project: Hadoop Core > Issue Type: Improvement > Components: dfs > Affects Versions: 0.16.3, 0.18.0 > Environment: Mac OS X 10.5.2, Java 6 > Reporter: Sam Pullara > Fix For: 0.18.0 > > > I looked at all the code long and hard and this was my analysis (could be wrong, I'm not an expert on this codebase): > Current Serial HDFS performance = Average Datanode Performance > Average Datanode Performance = Average Disk Performance (even if you have more than one) > We should have: > Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance > Ideal Datanode Performance = Sum of disk performance > When you read a single file serially from HDFS there are a number of limitations that come into play: > 1) Blocks on multiple datanodes will be load balanced between them - averaging the performance of the datanodes > 2) Blocks on multiple disks in a single datanode are load balanced between them - averaging the performance of the disks > I think that all this could be fixed if we actually prefetched fully read blocks on the client until the client can no longer keep up with the data or there is another bottleneck like network bandwidth. > This seems like a reasonably common use case though not the typical MapReduce case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.