Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 994E6D9D4 for ; Tue, 4 Sep 2012 17:12:44 +0000 (UTC) Received: (qmail 55701 invoked by uid 500); 4 Sep 2012 17:12:40 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 55397 invoked by uid 500); 4 Sep 2012 17:12:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Delivered-To: moderator for user@hadoop.apache.org Received: (qmail 96087 invoked by uid 99); 4 Sep 2012 08:30:46 -0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of younggeun.park@gmail.com designates 209.85.220.176 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=/HxhqV8pRnpjKZ/kGvGhZ3BpqcGCLtLb3Vlc6ec/vw4=; b=AljdM3y4m5SzCSb+moHZu548ObzXsAB9AjElAAZ22oILh0GPOURG+v/0ZdUsSSzQRD 8vdzdH1l+Vs8Eh3sCTwprYKR7sib4lNM6ewQvwlsSJuJ5duJ0zTYT7uoGBD4Rv2mW9Oh kEX7+a+rWAd3N1W4IFYMnNCIiqOL9lk6uanUg1Cj9MrQBp2TwDF/gybM+CHhm7cCt0Wi Bm90S0vU6UHaQJn8tpBjzuqoiHa7z7Fj8YKcy0oX3ZY8TEnHbFVNGbKX4HQO3RwHxH/K qCZxCqh9DlXS6mv3kdZrVSdqC1PW3fresyFbWgBe19AHYJc8619MY8ue3En1RyxOF3wG 11sw== MIME-Version: 1.0 Date: Tue, 4 Sep 2012 17:30:18 +0900 Message-ID: Subject: questions about SequenceFile From: Young-Geun Park To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b6dcaf076654804c8dc11c7 --047d7b6dcaf076654804c8dc11c7 Content-Type: text/plain; charset=ISO-8859-1 Hi, All I run a MR program, WordCount: InputFile is a sequence file compressed by snappy block type. InputFormat is SequenceFileInputFormat. To check whether SequenceFile.Writer.sync() method would affect a MR program, At one case, writer.sync() method was called. the sync() method did not be called at another case. The result was that there no difference about MR running time between two cases. The elapsed times of two case was about the same. Does NOT the sync() method in the SequenceFile.Writer affect MR performance? Another question; According to sources, a sequence file would be splited at getSplits() in FileInputFormat, which is super class of SequenceFileInputFormat. SplitSize in getSplits() method would be determined to default block size (dfs.block.size) in case using default configurations. But I think that a record boundary should be considered in splitting sequence file. I cannot understand splitting a sequence file by default block size without considerations about the record boundary. Do I miss something? Regards, Park --047d7b6dcaf076654804c8dc11c7 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

Hi, All


I run a MR program, WordCount:=A0

InputFi= le is a sequence file compressed by snappy =A0block type.

InputFormat= is SequenceFileInputFormat.


To check whether SequenceFile= .Writer.sync() method =A0would affect a MR program,=A0

At one case, writer.sync() method was called. the sync() method did not = be called at another case.


The result was that there no di= fference about MR running time between two cases.

The elapsed times o= f two case was about the same.


Does NOT the sync() method in the SequenceFile.Writer affect = =A0MR performance?


Another question;

According to so= urces, a sequence file would be splited at getSplits() in FileInputFormat,= =A0

which is super class of SequenceFileInputFormat.

SplitSize in getS= plits() method would be determined to default block size (dfs.block.size) i= n case using default configurations.

But I think that a record bounda= ry should be considered in splitting sequence file.

I cannot understand splitting a sequence file by default block size with= out considerations about the record boundary.

Do I miss something?=A0=


Regards,=A0

Park =A0

--047d7b6dcaf076654804c8dc11c7--