Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 35424 invoked from network); 11 Jun 2006 15:07:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Jun 2006 15:07:31 -0000 Received: (qmail 73205 invoked by uid 500); 11 Jun 2006 15:07:30 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 73171 invoked by uid 500); 11 Jun 2006 15:07:30 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 73160 invoked by uid 99); 11 Jun 2006 15:07:30 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Jun 2006 08:07:30 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [74.0.0.77] (HELO linuxfly.dragonflymc.com) (74.0.0.77) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Jun 2006 08:07:28 -0700 Received: from [192.168.1.101] (pool-71-240-160-159.dllstx.fios.verizon.net [71.240.160.159]) by linuxfly.dragonflymc.com (Postfix) with ESMTP id 55D31C2006B for ; Sun, 11 Jun 2006 10:07:08 -0500 (CDT) Message-ID: <448C3199.9040908@dragonflymc.com> Date: Sun, 11 Jun 2006 10:07:05 -0500 From: Dennis Kubes User-Agent: Thunderbird 1.5.0.4 (Windows/20060516) MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Out of Memory during Sorts Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Can someone lead me in the right direction as to configuring settings for large sorting operations > 1M rows. I keep getting out of memory exceptions during the sort phase. Here are my current settings. I have 2G heap space on each box. Dennis io.sort.factor 20 The number of streams to merge at once while sorting files. This determines the number of open file handles. io.sort.mb 200 The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. io.file.buffer.size 8192 The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. io.bytes.per.checksum 4096 The number of bytes per checksum. Must not be larger than io.file.buffer.size.