Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 88265 invoked from network); 19 Mar 2009 18:27:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Mar 2009 18:27:17 -0000 Received: (qmail 75747 invoked by uid 500); 19 Mar 2009 18:27:10 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 75709 invoked by uid 500); 19 Mar 2009 18:27:10 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 75698 invoked by uid 99); 19 Mar 2009 18:27:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Mar 2009 11:27:10 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stefan.will@gmx.net designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 19 Mar 2009 18:27:02 +0000 Received: (qmail invoked by alias); 19 Mar 2009 18:26:39 -0000 Received: from adsl-70-132-2-217.dsl.snfc21.sbcglobal.net (EHLO [192.168.1.102]) [70.132.2.217] by mail.gmx.net (mp036) with SMTP; 19 Mar 2009 19:26:39 +0100 X-Authenticated: #2748410 X-Provags-ID: V01U2FsdGVkX1+DrhE7CpWDshkKHURhrcBkOz3oHCaP8/c5j0oSan eizkY2SIPwSpG/ User-Agent: Microsoft-Entourage/12.15.0.081119 Date: Thu, 19 Mar 2009 11:26:37 -0700 Subject: Re: intermediate results not getting compressed From: Stefan Will To: Message-ID: Thread-Topic: intermediate results not getting compressed Thread-Index: AcmowD0oYdYwsl+CQku/nX+3dLbl5g== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Y-GMX-Trusted: 0 X-FuHaFi: 0.5600000000000001 X-Virus-Checked: Checked by ClamAV on apache.org I noticed this too. I think the compression only applies to the final mapper and reducer outputs, but not any intermediate files produced. The reducer will decompress the map output files after copying them, and then compress its own output only after it has finished. I wonder if this is by design, or just an oversight. -- Stefan > From: Billy Pearson > Reply-To: > Date: Wed, 18 Mar 2009 22:14:07 -0500 > To: > Subject: Re: intermediate results not getting compressed > > I can run head on the map.out files and I get compressed garbish but I run > head on a intermediate file and I can read the data in the file clearly so > compression is not getting passed but I am setting the CompressMapOutput to > true by default in my hadoop-site.conf file. > > Billy > > > "Billy Pearson" > wrote in message news:gpscu3$66p$1@ger.gmane.org... >> the intermediate.X files are not getting compresses for some reason not >> sure why >> I download and build the latest branch for 0.19 >> >> o.a.h.mapred.Merger.class line 432 >> new Writer(conf, fs, outputFile, keyClass, valueClass, codec); >> >> this seams to use the codec defined above but for some reasion its not >> working correctly the compression is not passing from the map output files >> to the on disk merge of the intermediate.X files >> >> tail task report from one server: >> >> 2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask: >> Interleaved on-disk merge complete: 1730 files left. >> 2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask: >> In-memory merge complete: 3 files left. >> 2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping >> 3 segments, 39835369 bytes in memory for intermediate, on-disk merge >> 2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging >> 1730 files, 70359998581 bytes from disk >> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging >> 0 segments, 0 bytes from memory into reduce >> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733 >> sorted segments >> 2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22 >> intermediate segments out of a total of 1733 >> 2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1712 >> 2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1683 >> 2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1654 >> 2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1625 >> 2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1596 >> 2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1567 >> 2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1538 >> 2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1509 >> 2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1480 >> 2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1451 >> 2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1422 >> 2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1393 >> 2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1364 >> 2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1335 >> 2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30 >> intermediate segments out of a total of 1306 >> >> See the size of the files is about ~70GB (70359998581) these are >> compressed at this points its went from 1733 file to 1306 left to merge >> and the intermediate.X files are well over 200Gb at this point and we are >> not even close to done. If compression is working we should not see task >> failing at this point in the task becuase lack of hard drvie space sense >> as we merge we delete the merged file from the output folder. >> >> I only see this happening when there are to many files left that did not >> get merged durring the shuffle stage and it starts on disk mergeing. >> the task that complete the merges and keep it below the io.sort size in my >> case 30 skips the on disk merge and complete useing normal hard drive >> space. >> >> Anyone care to take a look? >> This job takes two or more days to get to this point so getting kind of a >> pain in the butt to run and watch the reduces fail and the job keep >> failing no matter what. >> >> I can post the tail of this task long when it fails to show you how far it >> gets before it runs out of space. before redcue on disk merge starts the >> disk are about 35-40% used on 500GB Drive and two taks runnning at the >> same time. >> >> Billy Pearson >> >> >