hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Young-Geun Park <younggeun.p...@gmail.com>
Subject Lzo vs SequenceFile for big file
Date Thu, 06 Sep 2012 23:25:25 GMT
Hi, All

I have tested which method is better between Lzo and SequenceFile for a BIG
file.

File size is 10GiB and WordCount MR is used.
Inputs of WordCount MR are  lzo which would be indexed by
LzoIndexTool(lzo),
sequence file which is compressed by block level snappy(seq)  , and
 uncompressed original file(none).

Map output  is compressed except of uncompressed file. mapreduce output is
not compressed for all cases.

The following are wordcount MR running time;
none       lzo         seq
248s      243s     1410s

-Test Environments

   - OS : CentOS 5.6 (x64) (kernel = 2.6.18)
   - # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
   - RAM : 18GB
   - Java version : 1.6.0_26
   - Hadoop version : CDH3U2
   - # of datanode(tasktracker) :  8

According to the result, The running time of SequnceFile is much less than
the others.
Before testing, I had expected that the results of  both SequenceFile and
Lzo are about the same.

I want to know why performance of the sequence file compressed by snappy is
so bad?

do I miss anything in tests?


Regards,
Park

Mime
View raw message