Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6139111528 for ; Fri, 28 Mar 2014 18:32:25 +0000 (UTC) Received: (qmail 48926 invoked by uid 500); 28 Mar 2014 18:32:17 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 48598 invoked by uid 500); 28 Mar 2014 18:32:17 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 48582 invoked by uid 99); 28 Mar 2014 18:32:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2014 18:32:16 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kchew534@gmail.com designates 209.85.217.178 as permitted sender) Received: from [209.85.217.178] (HELO mail-lb0-f178.google.com) (209.85.217.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2014 18:32:11 +0000 Received: by mail-lb0-f178.google.com with SMTP id s7so4064346lbd.9 for ; Fri, 28 Mar 2014 11:31:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=a8hn5Eun2S24xHVKAIEs6Gn27rEAQuViOXjqOiootTo=; b=UxsFFSpm+HQp00iwvgoPGYVHho2N9pyyRfGV+ucLU+to+o71FwStp4aAP2mXmGPKVu sPYfHF2M5/fBy/gBF6t8QQsqbDFpDjDyYXvOzpU2co57ZUhcd+Rfh6rYDPrEbPn1Q6BM W52P/gfXXcCMHe7JuYxqaqN3weUnZmmHC8+LyfGBwqfyr0abTWTV1V3EwTBs60PX65lm lZYvnf15N6s6Qa9jbh0+WvbFPgtjoaJjHyN7cz/x1MxYZXMauObeNZPCyWbxoKowTOl9 sHxv7sjGHQsfMJGgly+htAMNQ5bVhYhumN6TNAxxij+ZP6sNWoIWc6x1R/ocpCgfo9G+ RsPg== MIME-Version: 1.0 X-Received: by 10.112.47.3 with SMTP id z3mr2599977lbm.34.1396031510118; Fri, 28 Mar 2014 11:31:50 -0700 (PDT) Received: by 10.112.84.75 with HTTP; Fri, 28 Mar 2014 11:31:50 -0700 (PDT) In-Reply-To: References: <1395948630.3191.9.camel@bentzn-laptop-2013> Date: Fri, 28 Mar 2014 11:31:50 -0700 Message-ID: Subject: Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case? From: Kim Chew To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a1133ab6241867504f5aeeae9 X-Virus-Checked: Checked by ClamAV on apache.org --001a1133ab6241867504f5aeeae9 Content-Type: text/plain; charset=UTF-8 None of that. I checked the the input file's SequenceFile Header and it says "org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater" Kim On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya wrote: > what is your compression format gzip, lzo or snappy > > for lzo final output > > FileOutputFormat.setCompressOutput(conf, true); > FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class); > > In addition, to make LZO splittable, you need to make a LZO index file. > > > On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew wrote: > >> Thanks folks. >> >> I am not awared my input data file has been compressed. >> FileOutputFromat.setCompressOutput() is set to true when the file is >> written. 8-( >> >> Kim >> >> >> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote: >> >>> The following might answer you partially: >>> >>> Input key is not read from HDFS, it is auto generated as the offset of >>> the input value in the input file. I think that is (partially) why read >>> hdfs bytes is smaller than written hdfs bytes. >>> On Mar 27, 2014 1:34 PM, "Kim Chew" wrote: >>> >>>> I am also wondering if, say, I have two identical timestamp so they are >>>> going to be written to the same file. Does MulitpleOutputs handle appending? >>>> >>>> Thanks. >>>> >>>> Kim >>>> >>>> >>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: >>>> >>>>> Have you checked the content of the files you write? >>>>> >>>>> >>>>> /th >>>>> >>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: >>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper >>>>> > read a timestamp from the value, generate a path to the output file >>>>> > and writes the key and value to the output file. >>>>> > >>>>> > >>>>> > The input file is a sequence file, not compressed and stored in the >>>>> > HDFS, it has a size of 162.68 MB. >>>>> > >>>>> > >>>>> > Output also is written as a sequence file. >>>>> > >>>>> > >>>>> > >>>>> > However, after I ran my job, I have two output part files from the >>>>> > mapper. One has a size of 835.12 MB and the other has a size of >>>>> 224.77 >>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it >>>>> > be more or less equal to the input's size of 162.68MB since I just >>>>> > write the key and value passed to mapper to the output? >>>>> > >>>>> > >>>>> > Here is the mapper code snippet, >>>>> > >>>>> > public void map(BytesWritable key, BytesWritable value, Context >>>>> > context) throws IOException, InterruptedException { >>>>> > >>>>> > long timestamp = bytesToInt(value.getBytes(), >>>>> > TIMESTAMP_INDEX);; >>>>> > String tsStr = sdf.format(new Date(timestamp * 1000L)); >>>>> > >>>>> > mos.write(key, value, generateFileName(tsStr)); // mos is a >>>>> > MultipleOutputs object. >>>>> > } >>>>> > >>>>> > private String generateFileName(String key) { >>>>> > return outputDir+"/"+key+"/raw-vectors"; >>>>> > } >>>>> > >>>>> > >>>>> > And here are the job outputs, >>>>> > >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format >>>>> > Counters >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: >>>>> HDFS_BYTES_READ=171086386 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: >>>>> > HDFS_BYTES_WRITTEN=1111374798 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) >>>>> > snapshot=166428672 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap >>>>> > usage (bytes)=38351872 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent >>>>> (ms)=20080 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) >>>>> > snapshot=1240104960 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 >>>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 >>>>> > >>>>> > >>>>> > TIA, >>>>> > >>>>> > >>>>> > Kim >>>>> > >>>>> >>>>> >>>>> >>>> >> > --001a1133ab6241867504f5aeeae9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
None of that.

I checked the the inp= ut file's SequenceFile Header and it says "org.apache.hadoop.io.co= mpress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik= Pandya <smarty.juice@gmail.com> wrote:
what is your compression format gzip, lzo or snappy
for lzo final output

FileOutputFormat.setCompressOutput(conf, true)= ;
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.
=

On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew = <kchew534@gmail.com> wrote:
Thanks folks.
=
I am not awared my input data file has been compressed. FileOutpu= tFromat.setCompressOutput() is set to true when the file is written. 8-(

Kim
=


On Thu, Mar 27, 2014 at 5:46 PM, Mostafa= Ead <mostafa.g.ead@gmail.com> wrote:

The following might answer you partially:

Input key is not read from HDFS, it is auto generated as the= offset of the input value in the input file. I think that is (partially) w= hy read hdfs bytes is smaller than written hdfs bytes.

On Mar 27, 2014 1:34 PM, "Kim Chew" &l= t;kchew534@gmail.co= m> wrote:
I am also wondering if, say, I have two identica= l timestamp so they are going to be written to the same file. Does Mulitple= Outputs handle appending?

Thanks.

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas= Bentsen <th@bentzn.com> wrote: