hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Young-Geun Park <younggeun.p...@gmail.com>
Subject Lzo vs SequenceFile for big file
Date Thu, 06 Sep 2012 23:25:25 GMT
Hi, All

I have tested which method is better between Lzo and SequenceFile for a BIG

File size is 10GiB and WordCount MR is used.
Inputs of WordCount MR are  lzo which would be indexed by
sequence file which is compressed by block level snappy(seq)  , and
 uncompressed original file(none).

Map output  is compressed except of uncompressed file. mapreduce output is
not compressed for all cases.

The following are wordcount MR running time;
none       lzo         seq
248s      243s     1410s

-Test Environments

   - OS : CentOS 5.6 (x64) (kernel = 2.6.18)
   - # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
   - RAM : 18GB
   - Java version : 1.6.0_26
   - Hadoop version : CDH3U2
   - # of datanode(tasktracker) :  8

According to the result, The running time of SequnceFile is much less than
the others.
Before testing, I had expected that the results of  both SequenceFile and
Lzo are about the same.

I want to know why performance of the sequence file compressed by snappy is
so bad?

do I miss anything in tests?


View raw message