hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Problem found while using LZO compression in Hadoop 0.20.1
Date Wed, 09 Jun 2010 13:33:07 GMT
Hi,

Where did you get the LZO libraries? The ones on Google Code are broken,
please use the ones on github:

http://github.com/toddlipcon/hadoop-lzo

Thanks
-Todd


On Wed, Jun 9, 2010 at 2:59 AM, 李钰 <carp84@gmail.com> wrote:

> Hi,
>
> While using LZO compression to try to improve performance of my cluster, I
> found that compression didn't work. The job I run is
> "org.apache.hadoop.examples.Sort", with the input data generated by
> "org.apache.hadoop.examples.RandomWriter".
> I've made sure that I configured lzo native library/jar files right and set
> all compression related parameters (such as "mapred.compress.map.output",
> "mapred.output.compression.type", "mapred.output.compression.codec",
> "mapred.output.compress" and "map.output.compression.codec"), and the
> tasktracker did compress the map/job output through infomation got from job
> logs. But the output file is not compressed at all!
> Then I searched the internet, and found from
> http://wiki.apache.org/hadoop/SequenceFile that in *SequenceFile Common
> Header*, there're two bytes decided whether compression and block
> compression tuned on for the file. I checked the sequece file generated by
> RandomWriter, and the result is as follows:
>
> [hdpadmin@shihc008 rand-10mb]$ od -c part-00000 | head -n 15
> 0000000   S   E   Q 006   "   o   r   g   .   a   p   a   c   h   e   .
> 0000020   h   a   d   o   o   p   .   i   o   .   B   y   t   e   s   W
> 0000040   r   i   t   a   b   l   e   "   o   r   g   .   a   p   a   c
> 0000060   h   e   .   h   a   d   o   o   p   .   i   o   .   B   y   t
> 0000100   e   s   W   r   i   t   a   b   l   e  *\0  \0*  \0  \0  \0  \0
> 0000120 244   n   ! 177   L 316 030   q   g 035 351   L   ; 024 216 031
> 0000140  \0  \0  \t 234  \0  \0 001 305  \0  \0 001 301 207   v   5 255
> 0000160 220   ] 236   <  \b 367   &   9 241  \b   v 303   m 314 203 220
> 0000200 335  \0 241 325 232 035 037 267 303 360  \n 025   u   P 003 220
> 0000220   ^ 235 247 036   S 265 271 035   S 247   O   5 337   + 020   q
> 0000240 277   - 003 212   . 230 221   G 241   5   K   K 031 273 036 206
> 0000260   ( 317 303 367 351 214 364 262 340   S 211 230  \r 362   % 335
> 0000300   }   H   w   & 234   S   F 324 321 274   F 377   [ 344   [   h
> 0000320 204 001 265   ] 037   _   r   , 020 370 246 327 231 017 205 252
> 0000340 273 016 310   w 361 326 032 332 200   Y  \a   X 342  \r 016 364
>
> I found the marked two bytes are set to zero, which meant tune off the
> compression. And since the value of these two bytes are '\0', I guess this
> may be a defect that we ignored to set these two bytes and this
> makes sequece file generated by RandomWriter cannot be compressed.  And I
> don't know whether this appears in other place.
>
> Is my opinion right? If not, does anybody know what causes the compression
> not working? Looking forward to your reply!
>
> Thanks and Best Regards,
> Carp
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message