hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 박영근(Alex) <alex.p...@nexr.com>
Subject Re: Lzo vs SequenceFile for big file
Date Fri, 07 Sep 2012 09:06:29 GMT
Ruslan,
Thanks for your reply in advance.

Jobs' statistics are as follows;

case 1 : uncompressed data(none)
12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049
12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23
12/08/09 16:12:44 INFO mapred.JobClient:   Job Counters
12/08/09 16:12:44 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3623053
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Rack-local map tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     Launched map tasks=166
12/08/09 16:12:44 INFO mapred.JobClient:     Data-local map tasks=165
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=220786
12/08/09 16:12:44 INFO mapred.JobClient:   FileSystemCounters
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_READ=1852424288
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_READ=10644581454
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1894096220
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 16:12:44 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Combine output records=69055428
12/08/09 16:12:44 INFO mapred.JobClient:     Map input records=158156100
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce shuffle bytes=33143186
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Spilled Records=122916251
12/08/09 16:12:44 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 16:12:44 INFO mapred.JobClient:     Combine input
records=1332132129
12/08/09 16:12:44 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 16:12:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=19716
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input records=2172099

case2 : lzo
12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048
12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23
12/08/09 15:58:11 INFO mapred.JobClient:   Job Counters
12/08/09 15:58:11 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3361287
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Rack-local map tasks=4
12/08/09 15:58:11 INFO mapred.JobClient:     Launched map tasks=65
12/08/09 15:58:11 INFO mapred.JobClient:     Data-local map tasks=61
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=183529
12/08/09 15:58:11 INFO mapred.JobClient:   FileSystemCounters
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_READ=568178351
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_READ=3860287251
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=576095398
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 15:58:11 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Combine output records=66734193
12/08/09 15:58:11 INFO mapred.JobClient:     Map input records=158156100
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce shuffle bytes=4752406
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Spilled Records=132612729
12/08/09 15:58:11 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 15:58:11 INFO mapred.JobClient:     Combine input
records=1331190655
12/08/09 15:58:11 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 15:58:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7366
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input records=792338

case3 : sequence file compressed block-level by snappy

12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008

12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23

12/09/05 18:33:00 INFO mapred.JobClient:   Job Counters

12/09/05 18:33:00 INFO mapred.JobClient:     Launched reduce tasks=1

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5885897

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Rack-local map tasks=2

12/09/05 18:33:00 INFO mapred.JobClient:     Launched map tasks=68

12/09/05 18:33:00 INFO mapred.JobClient:     Data-local map tasks=66

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1320075

12/09/05 18:33:00 INFO mapred.JobClient:   FileSystemCounters

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_READ=3706936196

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_READ=4419150507

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581439981

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440

12/09/05 18:33:00 INFO mapred.JobClient:   Map-Reduce Framework

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input groups=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Combine output records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map input records=158156100

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce shuffle bytes=857964933

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce output records=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Spilled Records=6232725043

12/09/05 18:33:00 INFO mapred.JobClient:     Map output bytes=15704921900

12/09/05 18:33:00 INFO mapred.JobClient:     Combine input records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map output records=1265248800

12/09/05 18:33:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8382

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input records=1265248800
Regards,
Park

2012/9/7 Ruslan Al-Fakikh <ruslan.al-fakikh@jalent.ru>

> Hi,
>
> I would be interesting to see the jobs' statistics (counters).
>
> Thanks
>
> On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
> <younggeun.park@gmail.com> wrote:
> > Hi, All
> >
> > I have tested which method is better between Lzo and SequenceFile for a
> BIG
> > file.
> >
> > File size is 10GiB and WordCount MR is used.
> > Inputs of WordCount MR are  lzo which would be indexed by
> LzoIndexTool(lzo),
> > sequence file which is compressed by block level snappy(seq)  , and
> > uncompressed original file(none).
> >
> > Map output  is compressed except of uncompressed file. mapreduce output
> is
> > not compressed for all cases.
> >
> > The following are wordcount MR running time;
> > none       lzo         seq
> > 248s      243s     1410s
> >
> > -Test Environments
> >
> > OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> > # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> > RAM : 18GB
> > Java version : 1.6.0_26
> > Hadoop version : CDH3U2
> > # of datanode(tasktracker) :  8
> >
> > According to the result, The running time of SequnceFile is much less
> than
> > the others.
> > Before testing, I had expected that the results of  both SequenceFile and
> > Lzo are about the same.
> >
> > I want to know why performance of the sequence file compressed by snappy
> is
> > so bad?
> >
> > do I miss anything in tests?
> >
> >
> > Regards,
> > Park
> >
> >
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

Mime
View raw message