hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Using Hadoop for codec functionality
Date Sun, 31 Mar 2013 09:38:03 GMT
Your question could be interpreted in another way : should I use Hadoop in
order to perform massive compression/decompression using my own
(eventually, proprietary) utility?

So yes, Hadoop can be used to parallelize the work. But the real answer
will depend on your context, like always.
How many files need to be processed? What is the average size? Is your
utility parallelizable? How the data will be used after

The number of files and their size is important because Hadoop is designed
to deal with a relatively low number of files but relatively big : a few
millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
files. Many small files could become an issue for the performance. But a
huge files is not necessarily better because if your utility is not
parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
a single process to read the whole file and then the uncompressed version
need to be stored somewhere.

So the final question is : for what purpose? If it is for massive
decompression, keeping the compressed version inside Hadoop seems a sane
strategy. So it might be better to rely on a standard compression utility
and uncompress only before processing inside Hadoop itself. If it is for
compression, well, it might not be that massive because you might not
receive that many files at the same time.

The common strategy in Hadoop is not to compress a whole file but instead
compress the parts (blocks) of the file. This way the size of the
compression work is limited/bounded and the work can be parallelized even
with a non parallelizable compression utility. The drawback is that the
"list of compressed blocks" is not a standard compressed file. And so the
interoperability with other parts of your system is not granted without
extra work.


On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Dear Robert,
> SequenceFiles do have either record, block or no compression. You can
> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
> Best regards,
> Jens

Bertrand Dechoux

View raw message