Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 88913 invoked from network); 14 Oct 2006 02:51:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Oct 2006 02:51:23 -0000 Received: (qmail 93331 invoked by uid 500); 14 Oct 2006 02:50:43 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 93267 invoked by uid 500); 14 Oct 2006 02:50:43 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 93085 invoked by uid 99); 14 Oct 2006 02:50:43 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Oct 2006 19:50:42 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Oct 2006 19:50:41 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4CE0F7142FC for ; Fri, 13 Oct 2006 19:49:51 -0700 (PDT) Message-ID: <5713995.1160794191312.JavaMail.jira@brutus> Date: Fri, 13 Oct 2006 19:49:51 -0700 (PDT) From: "p sutter (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-141) Disk thrashing / task timeouts during map output copy phase In-Reply-To: <9808210.1145312482052.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12442193 ] p sutter commented on HADOOP-141: --------------------------------- [[ Old comment, sent by email on Wed, 2 Aug 2006 13:47:05 -0700 ]] Close it out! The new shuffle path is really great. > Disk thrashing / task timeouts during map output copy phase > ----------------------------------------------------------- > > Key: HADOOP-141 > URL: http://issues.apache.org/jira/browse/HADOOP-141 > Project: Hadoop > Issue Type: Bug > Components: mapred > Environment: linux > Reporter: p sutter > > MapOutputProtocol connections cause timeouts because of system thrashing and transferring the same file over and over again, ultimately leading to making no forward progress(medium sized job, 500GB input file, map output about as large as the input, 10 node cluster). > There are several bugs behind this, but the following two changes improved matters considerably. > (1) > The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for both reads and writes). By changing this buffer size to 256KB, the number of disk seeks are reduced and the problem went away. > Ideally there would be a buffer size parameter for this that is separate from the DFS io buffer size. > (2) > I also added the following code to the socket configuration in both Server.java and Client.java. No linger is a minor good idea in an enivronment with some packet loss (and you will have that when all the nodes get busy at once), but 256KB buffers is probably excessive, especially on a LAN, but it takes me two hours to test changes so I havent experimented. > socket.setSendBufferSize(256*1024); > socket.setReceiveBufferSize(256*1024); > socket.setSoLinger(false, 0); > socket.setKeepAlive(true); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira