Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of kchew534@gmail.com designates
 209.85.217.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMOW4CM=o337FQfq0fcMsiWANg-byEdKAjB8Y+j=zZyBCF5SSA@mail.gmail.com>
References: 
 <CAOTDg_QUxyq9it7+U+9xWRDJKP3Nwag=3oTAnCu10iBbKKaC8A@mail.gmail.com>
	<1395948630.3191.9.camel@bentzn-laptop-2013>
	<CAOTDg_T=sGDUAfWAo1macx31W2pKDK-vLQFL=YGd_tAqa4qMYw@mail.gmail.com>
	<CAFO-O3iJbTF5b-U2gE_13Q76EXgTRvWdx0yB+bsiBv_sm6Qhsg@mail.gmail.com>
	<CAOTDg_R097aJR=FzmApSSoRhuw84mBnd9H18=7wdS1YHHgnn7g@mail.gmail.com>
	<CAMOW4CM=o337FQfq0fcMsiWANg-byEdKAjB8Y+j=zZyBCF5SSA@mail.gmail.com>
Date: Fri, 28 Mar 2014 11:31:50 -0700
Message-ID: 
 <CAOTDg_Q9ihOVb6+H_P5ceT2gE9e6tLDwE_VzC3vZacdBPJ8Zfw@mail.gmail.com>
Subject: Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in
 this case?
From: Kim Chew <kchew534@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a1133ab6241867504f5aeeae9

--001a1133ab6241867504f5aeeae9
Content-Type: text/plain; charset=UTF-8

None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya <smarty.juice@gmail.com>wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kchew534@gmail.com> wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mostafa.g.ead@gmail.com>wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kchew534@gmail.com> wrote:
>>>
>>>> I am also wondering if, say, I have two identical timestamp so they are
>>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>>
>>>> Thanks.
>>>>
>>>> Kim
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th@bentzn.com> wrote:
>>>>
>>>>> Have you checked the content of the files you write?
>>>>>
>>>>>
>>>>> /th
>>>>>
>>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>>> > read a timestamp from the value, generate a path to the output file
>>>>> > and writes the key and value to the output file.
>>>>> >
>>>>> >
>>>>> > The input file is a sequence file, not compressed and stored in the
>>>>> > HDFS, it has a size of 162.68 MB.
>>>>> >
>>>>> >
>>>>> > Output also is written as a sequence file.
>>>>> >
>>>>> >
>>>>> >
>>>>> > However, after I ran my job, I have two output part files from the
>>>>> > mapper. One has a size of 835.12 MB and the other has a size of
>>>>> 224.77
>>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>>> > write the key and value passed to mapper to the output?
>>>>> >
>>>>> >
>>>>> > Here is the mapper code snippet,
>>>>> >
>>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>>> > context) throws IOException, InterruptedException {
>>>>> >
>>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>>> > TIMESTAMP_INDEX);;
>>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>>> >
>>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>>> > MultipleOutputs object.
>>>>> >     }
>>>>> >
>>>>> >         private String generateFileName(String key) {
>>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>>> >     }
>>>>> >
>>>>> >
>>>>> > And here are the job outputs,
>>>>> >
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>>> > Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> HDFS_BYTES_READ=171086386
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=166428672
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>>> > usage (bytes)=38351872
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent
>>>>> (ms)=20080
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=1240104960
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>>> >
>>>>> >
>>>>> > TIA,
>>>>> >
>>>>> >
>>>>> > Kim
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>>
>

--001a1133ab6241867504f5aeeae9
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>None of that.<br><br></div>I checked the the inp=
ut file&#39;s SequenceFile Header and it says &quot;org.apache.hadoop.io.co=
mpress.zlib.BuiltInZlibDeflater&quot;<br><br></div>Kim<br></div><div class=
=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Fri, Mar 28, 2014 at 10:34 AM, Hardik=
 Pandya <span dir=3D"ltr">&lt;<a href=3D"mailto:smarty.juice@gmail.com" tar=
get=3D"_blank">smarty.juice@gmail.com</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">
<div dir=3D"ltr">what is your compression format gzip, lzo or snappy<br><br=
>for lzo final output<br><br>FileOutputFormat.setCompressOutput(conf, true)=
;<br>FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);<br>
<br>
In addition, to make LZO splittable, you need to make a LZO index file.<br>=
</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew =
<span dir=3D"ltr">&lt;<a href=3D"mailto:kchew534@gmail.com" target=3D"_blan=
k">kchew534@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Thanks folks.<br>=
<br></div>I am not awared my input data file has been compressed. FileOutpu=
tFromat.setCompressOutput() is set to true when the file is written. 8-(<sp=
an><font color=3D"#888888"><br>

<br></font></span></div><span><font color=3D"#888888">Kim<br></font></span>=
</div><div><div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Thu, Mar 27, 2014 at 5:46 PM, Mostafa=
 Ead <span dir=3D"ltr">&lt;<a href=3D"mailto:mostafa.g.ead@gmail.com" targe=
t=3D"_blank">mostafa.g.ead@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<p dir=3D"ltr">The following might answer you partially:</p>
<p dir=3D"ltr">Input key is not read from HDFS, it is auto generated as the=
 offset of the input value in the input file. I think that is (partially) w=
hy read hdfs bytes is smaller than written hdfs bytes.<br>
</p><div><div>
<div class=3D"gmail_quote">On Mar 27, 2014 1:34 PM, &quot;Kim Chew&quot; &l=
t;<a href=3D"mailto:kchew534@gmail.com" target=3D"_blank">kchew534@gmail.co=
m</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir=3D"ltr"><div><div>I am also wondering if, say, I have two identica=
l timestamp so they are going to be written to the same file. Does Mulitple=
Outputs handle appending?<br><br></div>Thanks.<br><br></div>Kim<br></div>


<div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Thu, Mar 27, 2014 at 12:30 PM, Thomas=
 Bentsen <span dir=3D"ltr">&lt;<a href=3D"mailto:th@bentzn.com" target=3D"_=
blank">th@bentzn.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">


Have you checked the content of the files you write?<br>
<br>
<br>
/th<br>
<div><div><br>
On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:<br>
&gt; I have a simple M/R job using Mapper only thus no reducer. The mapper<=
br>
&gt; read a timestamp from the value, generate a path to the output file<br=
>
&gt; and writes the key and value to the output file.<br>
&gt;<br>
&gt;<br>
&gt; The input file is a sequence file, not compressed and stored in the<br=
>
&gt; HDFS, it has a size of 162.68 MB.<br>
&gt;<br>
&gt;<br>
&gt; Output also is written as a sequence file.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; However, after I ran my job, I have two output part files from the<br>
&gt; mapper. One has a size of 835.12 MB and the other has a size of 224.77=
<br>
&gt; MB. So why is the total outputs size is so much larger? Shouldn&#39;t =
it<br>
&gt; be more or less equal to the input&#39;s size of 162.68MB since I just=
<br>
&gt; write the key and value passed to mapper to the output?<br>
&gt;<br>
&gt;<br>
&gt; Here is the mapper code snippet,<br>
&gt;<br>
&gt; public void map(BytesWritable key, BytesWritable value, Context<br>
&gt; context) throws IOException, InterruptedException {<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 long timestamp =3D bytesToInt(value.getByt=
es(),<br>
&gt; TIMESTAMP_INDEX);;<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 String tsStr =3D sdf.format(new Date(times=
tamp * 1000L));<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 mos.write(key, value, generateFileName(tsS=
tr)); // mos is a<br>
&gt; MultipleOutputs object.<br>
&gt; =C2=A0 =C2=A0 }<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 private String generateFileName(String key=
) {<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 return outputDir+&quot;/&quot;+key+&quot;/=
raw-vectors&quot;;<br>
&gt; =C2=A0 =C2=A0 }<br>
&gt;<br>
&gt;<br>
&gt; And here are the job outputs,<br>
&gt;<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Launched map ta=
sks=3D2<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Data-local map =
tasks=3D2<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 SLOTS_MILLIS_RE=
DUCES=3D0<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 File Output Format<br>
&gt; Counters<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Bytes Written=
=3D0<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 FileSystemCounters<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 HDFS_BYTES_READ=
=3D171086386<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 FILE_BYTES_WRIT=
TEN=3D54272<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient:<br>
&gt; HDFS_BYTES_WRITTEN=3D1111374798<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 File Input Format Coun=
ters<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Bytes Read=3D17=
0782415<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 Map-Reduce Framework<b=
r>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Map input recor=
ds=3D547<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Physical memory=
 (bytes)<br>
&gt; snapshot=3D166428672<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Spilled Records=
=3D0<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Total committed=
 heap<br>
&gt; usage (bytes)=3D38351872<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 CPU time spent =
(ms)=3D20080<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Virtual memory =
(bytes)<br>
&gt; snapshot=3D1240104960<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 SPLIT_RAW_BYTES=
=3D286<br>
&gt; 14/03/27 11:00:56 INFO mapred.JobClient: =C2=A0 =C2=A0 Map output reco=
rds=3D0<br>
&gt;<br>
&gt;<br>
&gt; TIA,<br>
&gt;<br>
&gt;<br>
&gt; Kim<br>
&gt;<br>
<br>
<br>
</div></div></blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a1133ab6241867504f5aeeae9--