hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Bzip2 files as an input to MR job
Date Tue, 23 Sep 2014 13:39:02 GMT
Georgi:
I think  you misunderstand the originally answer.
If you already use Avor format, then the file will be splitable. If you want to add compression
on top of that,  feel free going ahead.
If you read the Avor DataFileWriter API:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
You will see there is a setCodec method, which allow to you specify any codec to compress
your data.
The compression can be either per block, or per record. Per block is recommended, as it will
be more efficient.
You can use bzip2 or gzip or snappy or any other compression. You just need to to use the
above api, and make sure the compression codec is available in all your task nodes.
splitable or unsplitable compression doesn't matter to you in this case, as you are using
AVRO, which is splitable.
What you need to choose is which compression is better, or fit your application usage case.
In our production, we use snappy, as it gives us a good balance between compression ratio
and read/decompression speed and CPU usage.
Different compressions have trade off. You need to compare them based on your case.
Yong

Date: Mon, 22 Sep 2014 17:21:29 +0200
From: ivanov@vesseltracker.com
To: user@hadoop.apache.org
Subject: Re: Bzip2 files as an input to MR job


  
    
  
  
    Hi Niels,

      Thanks for the reply.

      Changing the avro files is not really an option for me as it will
      require a lot of time( i have a lot ).

      The Avro files themself are compressed a bit.

      But still bzip2 gives 50% compression on one avro file.

      

      So what i want is , to use Bzip2 compressed file as an input to my
      MR jobs.

      Bzip2 is splittable.

      Should be possible somehow , but i don't seem to find it atm.

      

      On 22.09.2014 17:13, Niels Basjes wrote:

    
    
      
        Hi,
        

        
        You can use the GZip inside the AVRO files and still have
        splittable AVRO files.
        This has the to with the fact that there is a block
          structure inside the AVRO and these blocks are gzipped.

        
        

        
        I suggest you simply try it.
        

        
        Niels
        

        
        

          On Mon, Sep 22, 2014 at 4:40 PM,
            Georgi Ivanov <ivanov@vesseltracker.com>
            wrote:

            Hi guys,

              I would like to compress the files on HDFS to save some
              storage.

              

              As far as i see bzip2 is the only format which is
              splitable (and slow).

              

              The actual files are Avro.

              

              So in my driver class i have :

              

              job.setInputFormatClass(AvroKeyInputFormat.class);

              

              I have number of jobs running processing Avro files so i
              would like to keep the code change to a minimum.

              

              Is it possible to comrpess these avro files with bzip2 and
              keep the code of MR jobs the same (or with little change)

              If it is , please give me some hints as so far i don't
              seem to find any good resources on the Internet.

                  

                  

                  Georgi

                
          
          

          
          

          
          -- 

          Best regards / Met vriendelijke groeten,

          

          Niels Basjes
        
      
    
    
 		 	   		  
Mime
View raw message