Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 12289 invoked from network); 15 Feb 2006 01:31:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 15 Feb 2006 01:31:39 -0000 Received: (qmail 59649 invoked by uid 500); 15 Feb 2006 01:31:39 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 59624 invoked by uid 500); 15 Feb 2006 01:31:39 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 59615 invoked by uid 99); 15 Feb 2006 01:31:39 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2006 17:31:37 -0800 Received: from ajax.apache.org (ajax.apache.org [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 68DC4DC for ; Wed, 15 Feb 2006 02:31:16 +0100 (CET) Message-ID: <1399208558.1139967075983.JavaMail.jira@ajax.apache.org> Date: Wed, 15 Feb 2006 02:31:15 +0100 (CET) From: "eric baldeschwieler (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-38) default splitter should incorporate fs block size In-Reply-To: <1089400937.1139945203207.JavaMail.jira@ajax.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-38?page=comments#action_12366421 ] eric baldeschwieler commented on HADOOP-38: ------------------------------------------- some thoughts: 1) Eventually you are going to want raid / erasure coding style things. The simplest way to do this without breaking reads is to batch several blocks, keeping them linear and then generate parity once all blocks are full. This gets more expensive as block size increases. At current sizes, this can all be buffered in RAM in some cases. 1GB blocks rule that out. 2) Currently you can trivially keep a block in RAM for a MAP task. Depending on scaling factor, you can probably keep the output in ram for sorting, reduction, etc. too. This is nice. As block size increases you loose this property. 3) When you loose a node, the finer grained the lost data, the fewer hotspots you have in the system. Today in a large cluster you can easily have choke points with ~33mbit aggregate all to all. We've seen problems with larger data sizes slowing recovery times to a real problem. 1GB blocks take 10x as long to transmit, and this turns into minutes, which will require more sophisticated management. --- None of these are show stoppers, but one of the main reasons we are interested in hadoop is in getting off of our current very large storage chunk system, so I'd hate to see the default move quickly to something as large as 1GB. I can see the advantages of pushing the block size up to manage task tracker RAM size, but I doubt that alone will prove a compelling reason for us to change our default block size. On the other hand, I also don't think we'll be pumping 1 peta byte through a single m/r in the near term, so we can assume the zero code solution, change block size, until we have more data to support some other approach. Of course at 1M tasks, you will want to be careful about linear scans anyway... I've no concern with the proposal in this bug. Probably can take this discussion elsewhere > default splitter should incorporate fs block size > ------------------------------------------------- > > Key: HADOOP-38 > URL: http://issues.apache.org/jira/browse/HADOOP-38 > Project: Hadoop > Type: Improvement > Components: mapred > Reporter: Doug Cutting > > By default, the file splitting code should operate as follows. > inputs are *, numMapTasks, minSplitSize, fsBlockSize > output is * > totalSize = sum of all file sizes; > desiredSplitSize = totalSize / numMapTasks; > if (desiredSplitSize > fsBlockSize) /* new */ > desiredSplitSize = fsBlockSize; > if (desiredSplitSize < minSplitSize) > desiredSplitSize = minSplitSize; > chop input files into desiredSplitSize chunks & return them > In other words, the numMapTasks is a desired minimum. We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block. > If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size. > This handles cases where: > - each input record takes a lot of time to process. In this case we want to make sure we use all of the cluster. Thus it is important to permit splits smaller than the fs block size. > - input i/o dominates. In this case we want to permit the placement of tasks on hosts where their data is local. This is only possible if splits are fs block size or smaller. > Are there other common cases that this algorithm does not handle well? > The part marked 'new' above is not currently implemented, but I'd like to add it. > Does this sound reasonble? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira