Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 6582 invoked from network); 19 Nov 2008 17:16:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Nov 2008 17:16:37 -0000 Received: (qmail 47645 invoked by uid 500); 19 Nov 2008 17:16:45 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 46974 invoked by uid 500); 19 Nov 2008 17:16:43 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 46963 invoked by uid 99); 19 Nov 2008 17:16:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Nov 2008 09:16:43 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Nov 2008 17:15:29 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 55235234C29B for ; Wed, 19 Nov 2008 09:15:44 -0800 (PST) Message-ID: <325004490.1227114944347.JavaMail.jira@brutus> Date: Wed, 19 Nov 2008 09:15:44 -0800 (PST) From: "Yuri Pradkin (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4614) "Too many open files" error while processing a large gzip file In-Reply-To: <1796722425.1226048204926.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649100#action_12649100 ] Yuri Pradkin commented on HADOOP-4614: -------------------------------------- bq. I doubt whether there exists a testcase that would spill more than once in the map task (note that the code in question would be exercised only if the number of first level spills is greater than 1. If I understand what you're saying, in that code numSpills is most commonly 1 and the code in question will run at all because of a prior check... How would a test case force multiple spills? Can it for instance set *io.sort.mb*, *io.sort.spill.percent*, and *io.sort.record.percent* to something really small? Will this alone do the trick? > "Too many open files" error while processing a large gzip file > -------------------------------------------------------------- > > Key: HADOOP-4614 > URL: https://issues.apache.org/jira/browse/HADOOP-4614 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.18.2 > Reporter: Abdul Qadeer > Assignee: Yuri Pradkin > Fix For: 0.18.3 > > Attachments: HADOOP-4614.patch, HADOOP-4614.patch, openfds.txt > > > I am running a simple word count program on a gzip compressed data of size 4 GB (Uncompressed size is about 7 GB). I have setup of 17 nodes in my Hadoop cluster. After some time, I get the following exception: > java.io.FileNotFoundException: /usr/local/hadoop/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200811041109_0003/attempt_200811041109_0003_m_000000_0/output/spill4055.out.index > (Too many open files) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:137) > at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.(RawLocalFileSystem.java:62) > at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.(RawLocalFileSystem.java:98) > at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:168) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359) > at org.apache.hadoop.mapred.IndexRecord.readIndexFile(IndexRecord.java:47) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.getIndexInformation(MapTask.java:1339) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1237) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) > at org.apache.hadoop.mapred.Child.main(Child.java:155) > From a user's perspective I know that Hadoop will use only one mapper for a gzipped file. The above exception suggests that probably Hadoop puts the intermediate data into many files. But the question is that "exactly how many open files are required in the worst case for any data size and cluster size?" Currently it looks as if Hadoop needs more number of open files as the size of input or the cluster size (in terms of nodes, mapper, reducers) increases. This is not plausible as far as scalability is concerned. A user needs to write some number in the /etc/security/limits.conf file that how many open files are allowed by hadoop node. The question is what that "magical number" should be? > So probably the best solution to this problem is to change Hadoop such a way that it can work with some moderate number of allowed open files (e.g. 4 K) or any other number should be suggested as an upper limit such that a user is sure that for any data size and cluster size, hadoop will not run into this "too many open files" issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.