Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 79116 invoked from network); 30 Apr 2008 20:39:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Apr 2008 20:39:58 -0000 Received: (qmail 3165 invoked by uid 500); 30 Apr 2008 20:39:57 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 3122 invoked by uid 500); 30 Apr 2008 20:39:57 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 3111 invoked by uid 99); 30 Apr 2008 20:39:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2008 13:39:57 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of liliwu@gmail.com designates 209.85.200.175 as permitted sender) Received: from [209.85.200.175] (HELO wf-out-1314.google.com) (209.85.200.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2008 20:39:13 +0000 Received: by wf-out-1314.google.com with SMTP id 28so466367wfa.2 for ; Wed, 30 Apr 2008 13:39:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:mime-version:content-type; bh=PpKH63/FHTSouoIaFucNTnYuSJOnLm5MZv1VnRU93DQ=; b=UnrFKuy8718gBCTx/ZtWE8fnO/wHiMVqtxB56aY5yMY3T5dLO2GFmeU0HqylYESTyBpUHlAaOnEMpLKNKEiF39Z6oc8Z03ubTVKqxAhOm5Mt9tm+scuhxByk6kLyKg7io7CoBciT7cYnkszFMwhWdu2mBr7N1oqWUrXXdYs3r84= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:mime-version:content-type; b=UnVO/GKTfGkmDp5IZclXAP5D51ge9fiIESHc+/kzMInoCtvD5hxRM+n5goI6RYDCPzmi5armRBgmJ2ed0K4U1Uf21aIKRKVPbNWnycSk1P5jtf4AWFY+j3ZpnD8qCrEYFpaqM/8kTUOjTnE/0vRflg7vGuyirNKcUHJHwF1PfzQ= Received: by 10.142.163.14 with SMTP id l14mr471419wfe.230.1209587967698; Wed, 30 Apr 2008 13:39:27 -0700 (PDT) Received: by 10.142.156.10 with HTTP; Wed, 30 Apr 2008 13:39:27 -0700 (PDT) Message-ID: <7b2728090804301339q46421e6eobd9f1d040d1a0fcc@mail.gmail.com> Date: Wed, 30 Apr 2008 13:39:27 -0700 From: "Lili Wu" To: core-user@hadoop.apache.org Subject: OOM error with large # of map tasks Cc: samr@ning.com MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_4659_21130744.1209587967719" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_4659_21130744.1209587967719 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline We are using hadoop 0.16 and are seeing a consistent problem: out of memory errors when we have a large # of map tasks. The specifics of what is submitted when we reproduce this: three large jobs: 1. 20,000 map tasks and 10 reduce tasks 2. 17,000 map tasks and 10 reduce tasks 3. 10,000 map tasks and 10 reduce tasks these are at normal priority and periodically we swap the priorities around to get some tasks started by each and let them complete. other smaller jobs come and go every hour or so (no more than 200 map tasks, 4-10 reducers). Our cluster consists of 23 nodes and we have 69 map tasks and 69 reduce tasks. Eventually, we see consistent oom errors in the task logs and the task tracker itself goes down on as many as 14 of our nodes. We examined a heap dump after one of these crashes of a TaskTracker and found something interesting--there were 572 instances of JobConf's that accounted for 940mb of String objects. This seems quite odd that there are so many instances of JobConf. It seems to correlate with task in the COMMIT_PENDING state as shown on the status for a task tracker node. Has anyone observed something like this? can anyone explain what would cause tasks to remain in this state? (which also apparently is in-memory vs serialized to disk...). In general, what does COMMIT_PENDING mean? (job done, but output not committed to dfs?) Thanks! ------=_Part_4659_21130744.1209587967719--