hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das" <d...@yahoo-inc.com>
Subject RE: File Compression
Date Tue, 13 Nov 2007 17:23:57 GMT
Yes, io.seqfile.compression controls compression of only the mapred files. A
way you can compress files on the dfs, independent of mapred, is to use the
java.util.zip package over the OutputStream that the
DistributedFileSystem.create returns. For example, you can use
java.util.zip.GZIPOutputStream. Pass the
org.apache.hadoop.fs.FSDataOutputStream that
org.apache.hadoop.dfs.DistributedFileSystem.create() returns as an argument
to the GZIPOutputStream constructor.  

> -----Original Message-----
> From: Michael Harris [mailto:MichaelH@Telespree.com] 
> Sent: Tuesday, November 13, 2007 10:27 PM
> To: hadoop-user@lucene.apache.org
> Subject: File Compression
> 
> I have a question about file compression in Hadoop. When I 
> set the io.seqfile.compression.type=BLOCK does this also 
> compress actual files I load in the DFS or does this only 
> control the map/reduce file compression? If it doesnt 
> compress the files on the file system, is there any way to 
> compress a file when its loaded? The concern here is that I 
> am just getting started with Pig/Hadoop and have a very small 
> cluster of around 5 nodes. I want to limit IO wait by 
> compressing the actual data. As a test when I compressed our 
> 4GB log file using rar it was only 280mb.
> 
> Thanks,
> Michael
> 


Mime
View raw message