hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Young-Geun Park <younggeun.p...@gmail.com>
Subject questions about SequenceFile
Date Tue, 04 Sep 2012 08:30:18 GMT
Hi, All


I run a MR program, WordCount:

InputFile is a sequence file compressed by snappy  block type.

InputFormat is SequenceFileInputFormat.


To check whether SequenceFile.Writer.sync() method  would affect a MR
program,

At one case, writer.sync() method was called. the sync() method did not be
called at another case.


The result was that there no difference about MR running time between two
cases.

The elapsed times of two case was about the same.


Does NOT the sync() method in the SequenceFile.Writer affect  MR
performance?


Another question;

According to sources, a sequence file would be splited at getSplits() in
FileInputFormat,

which is super class of SequenceFileInputFormat.

SplitSize in getSplits() method would be determined to default block size
(dfs.block.size) in case using default configurations.

But I think that a record boundary should be considered in splitting
sequence file.

I cannot understand splitting a sequence file by default block size without
considerations about the record boundary.

Do I miss something?


Regards,

Park

Mime
View raw message