hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Sautins" <andy.saut...@returnpath.net>
Subject Strange behavior with bzip2 input files w/release 0.19.0
Date Thu, 04 Dec 2008 17:11:57 GMT

    I'm seeing some strange behavior with bzip2 files and release
0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
Basically it _looks_ like the processing of a particular bzip2 input
file is stopping after the first bzip2 block.  Below is a comparison of
tests  between a .gz file which seems to do what I expect, and the same
file .bz2 which doesn't behave as I expect.


    I have the same file stored in hadoop compressed as both bzip2 and
gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
the files they both seem to be valid archives of the exact same file.  


/usr/local/hadoop/bin/hadoop dfs -cat
bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum

2c82901170f44245fb04d24ad4746e38  -


/usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
| gunzip -c | md5sum

2c82901170f44245fb04d24ad4746e38  -


    Given the md5 sums match it seems like the files are the same and
uncompress correctly. 


    Now when I run a simple Map/Reduce application that just counts
lines in the file I get different results.  


  Expected Results:


 /usr/local/hadoop/bin/hadoop dfs -cat
bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l



   Gzip input file Results: 6,884,024

   Bzip2 input file Results: 9,420



   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
matches the size of the uncompressed file.  However, looking at
MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
input bytes)(900000)] ) which matches the block size of the bzip2
compressed file.  So that makes me think for some reason that only the
first bzip2 block of the bzip2 compressed file is being processed.


    So I'm wondering if my analysis is correct and if there could be an
issue with the processing of bzip2 input files.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message