hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: questions about SequenceFile
Date Wed, 05 Sep 2012 01:11:44 GMT
Hi Young,

Note that the SequenceFile.Writer#sync method != HDFS sync(), its just
a method that writes a sync marker (a set of bytes representing an end
points for one or more records, kinda like a newline in text files but
not for every record)

I don't think sync() would affect much. Although, if you want larger
compressed blocks, you should sync fewer times (i.e. more data between
sync marker points).

The SequenceFile Reader takes care of the record boundary checks when
given an offset and an length to read. The reader will auto-adjust the
read until the next sync-point. The logic of record boundary reading
in MR split-read mode is hence similar to the newline file reading
explained under http://wiki.apache.org/hadoop/HadoopMapReduce, except
think of the sync-markers as the newlines here.

On Tue, Sep 4, 2012 at 2:00 PM, Young-Geun Park
<younggeun.park@gmail.com> wrote:
> Hi, All
>
>
> I run a MR program, WordCount:
>
> InputFile is a sequence file compressed by snappy  block type.
>
> InputFormat is SequenceFileInputFormat.
>
>
> To check whether SequenceFile.Writer.sync() method  would affect a MR
> program,
>
> At one case, writer.sync() method was called. the sync() method did not be
> called at another case.
>
>
> The result was that there no difference about MR running time between two
> cases.
>
> The elapsed times of two case was about the same.
>
>
> Does NOT the sync() method in the SequenceFile.Writer affect  MR
> performance?
>
>
> Another question;
>
> According to sources, a sequence file would be splited at getSplits() in
> FileInputFormat,
>
> which is super class of SequenceFileInputFormat.
>
> SplitSize in getSplits() method would be determined to default block size
> (dfs.block.size) in case using default configurations.
>
> But I think that a record boundary should be considered in splitting
> sequence file.
>
> I cannot understand splitting a sequence file by default block size without
> considerations about the record boundary.
>
> Do I miss something?
>
>
> Regards,
>
> Park



-- 
Harsh J

Mime
View raw message