I run a MR program, WordCount:
InputFile is a sequence file compressed by snappy block type.
InputFormat is SequenceFileInputFormat.
To check whether SequenceFile.Writer.sync() method would affect a MR program,
At one case, writer.sync() method was called. the sync() method did not be called at another case.
The result was that there no difference about MR running time between two cases.
The elapsed times of two case was about the same.
Does NOT the sync() method in the SequenceFile.Writer affect MR performance?
According to sources, a sequence file would be splited at getSplits() in FileInputFormat,
which is super class of SequenceFileInputFormat.
SplitSize in getSplits() method would be determined to default block size (dfs.block.size) in case using default configurations.
But I think that a record boundary should be considered in splitting sequence file.
I cannot understand splitting a sequence file by default block size without considerations about the record boundary.
Do I miss something?