hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Young-Geun Park <younggeun.p...@gmail.com>
Subject Re: Lzo vs SequenceFile for big file
Date Mon, 10 Sep 2012 02:29:53 GMT
Is there anyone who had tested performance of sequence file format and lzo?

Regards,
Park

2012/9/7 Young-Geun PARK <younggeun.park@gmail.com>

> Ruslan,
> Thanks for your reply in advance.
>
> Jobs' statistics are as follows;
>
> case 1 : uncompressed data(none)
> 12/08/09 16:12:44 INFO mapred.JobClient: Job complete:
> job_201208021633_0049
> 12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23
> 12/08/09 16:12:44 INFO mapred.JobClient:   Job Counters
> 12/08/09 16:12:44 INFO mapred.JobClient:     Launched reduce tasks=1
> 12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3623053
> 12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 12/08/09 16:12:44 INFO mapred.JobClient:     Rack-local map tasks=1
> 12/08/09 16:12:44 INFO mapred.JobClient:     Launched map tasks=166
> 12/08/09 16:12:44 INFO mapred.JobClient:     Data-local map tasks=165
> 12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=220786
> 12/08/09 16:12:44 INFO mapred.JobClient:   FileSystemCounters
> 12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_READ=1852424288
> 12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_READ=10644581454
> 12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1894096220
> 12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
> 12/08/09 16:12:44 INFO mapred.JobClient:   Map-Reduce Framework
> 12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input groups=13661
> 12/08/09 16:12:44 INFO mapred.JobClient:     Combine output
> records=69055428
> 12/08/09 16:12:44 INFO mapred.JobClient:     Map input records=158156100
> 12/08/09 16:12:44 INFO mapred.JobClient:     Reduce shuffle bytes=33143186
> 12/08/09 16:12:44 INFO mapred.JobClient:     Reduce output records=13661
> 12/08/09 16:12:44 INFO mapred.JobClient:     Spilled Records=122916251
> 12/08/09 16:12:44 INFO mapred.JobClient:     Map output bytes=15704921900
> 12/08/09 16:12:44 INFO mapred.JobClient:     Combine input
> records=1332132129
> 12/08/09 16:12:44 INFO mapred.JobClient:     Map output records=1265248800
> 12/08/09 16:12:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=19716
> 12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input records=2172099
>
> case2 : lzo
> 12/08/09 15:58:11 INFO mapred.JobClient: Job complete:
> job_201208021633_0048
> 12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23
> 12/08/09 15:58:11 INFO mapred.JobClient:   Job Counters
> 12/08/09 15:58:11 INFO mapred.JobClient:     Launched reduce tasks=1
> 12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3361287
> 12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 12/08/09 15:58:11 INFO mapred.JobClient:     Rack-local map tasks=4
> 12/08/09 15:58:11 INFO mapred.JobClient:     Launched map tasks=65
> 12/08/09 15:58:11 INFO mapred.JobClient:     Data-local map tasks=61
> 12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=183529
> 12/08/09 15:58:11 INFO mapred.JobClient:   FileSystemCounters
> 12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_READ=568178351
> 12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_READ=3860287251
> 12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=576095398
> 12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
> 12/08/09 15:58:11 INFO mapred.JobClient:   Map-Reduce Framework
> 12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input groups=13661
> 12/08/09 15:58:11 INFO mapred.JobClient:     Combine output
> records=66734193
> 12/08/09 15:58:11 INFO mapred.JobClient:     Map input records=158156100
> 12/08/09 15:58:11 INFO mapred.JobClient:     Reduce shuffle bytes=4752406
> 12/08/09 15:58:11 INFO mapred.JobClient:     Reduce output records=13661
> 12/08/09 15:58:11 INFO mapred.JobClient:     Spilled Records=132612729
> 12/08/09 15:58:11 INFO mapred.JobClient:     Map output bytes=15704921900
> 12/08/09 15:58:11 INFO mapred.JobClient:     Combine input
> records=1331190655
> 12/08/09 15:58:11 INFO mapred.JobClient:     Map output records=1265248800
> 12/08/09 15:58:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7366
> 12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input records=792338
>
> case3 : sequence file compressed block-level by snappy
>
> 12/09/05 18:33:00 INFO mapred.JobClient: Job complete:
> job_201209051652_0008
>
> 12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23
>
> 12/09/05 18:33:00 INFO mapred.JobClient:   Job Counters
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5885897
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Rack-local map tasks=2
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Launched map tasks=68
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Data-local map tasks=66
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1320075
>
> 12/09/05 18:33:00 INFO mapred.JobClient:   FileSystemCounters
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_READ=3706936196
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_READ=4419150507
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581439981
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
>
> 12/09/05 18:33:00 INFO mapred.JobClient:   Map-Reduce Framework
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input groups=13661
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Combine output records=0
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Map input records=158156100
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Reduce shuffle bytes=857964933
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Reduce output records=13661
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Spilled Records=6232725043
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Map output bytes=15704921900
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Combine input records=0
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Map output records=1265248800
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8382
>
> 12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input
> records=1265248800
> Regards,
> Park
>
> 2012/9/7 Ruslan Al-Fakikh <ruslan.al-fakikh@jalent.ru>
>
>> Hi,
>>
>> I would be interesting to see the jobs' statistics (counters).
>>
>> Thanks
>>
>> On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
>> <younggeun.park@gmail.com> wrote:
>> > Hi, All
>> >
>> > I have tested which method is better between Lzo and SequenceFile for a
>> BIG
>> > file.
>> >
>> > File size is 10GiB and WordCount MR is used.
>> > Inputs of WordCount MR are  lzo which would be indexed by
>> LzoIndexTool(lzo),
>> > sequence file which is compressed by block level snappy(seq)  , and
>> > uncompressed original file(none).
>> >
>> > Map output  is compressed except of uncompressed file. mapreduce output
>> is
>> > not compressed for all cases.
>> >
>> > The following are wordcount MR running time;
>> > none       lzo         seq
>> > 248s      243s     1410s
>> >
>> > -Test Environments
>> >
>> > OS : CentOS 5.6 (x64) (kernel = 2.6.18)
>> > # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
>> > RAM : 18GB
>> > Java version : 1.6.0_26
>> > Hadoop version : CDH3U2
>> > # of datanode(tasktracker) :  8
>> >
>> > According to the result, The running time of SequnceFile is much less
>> than
>> > the others.
>> > Before testing, I had expected that the results of  both SequenceFile
>> and
>> > Lzo are about the same.
>> >
>> > I want to know why performance of the sequence file compressed by
>> snappy is
>> > so bad?
>> >
>> > do I miss anything in tests?
>> >
>> >
>> > Regards,
>> > Park
>> >
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Ruslan Al-Fakikh
>>
>
>

Mime
View raw message