Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 62468 invoked from network); 20 Apr 2009 05:19:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Apr 2009 05:19:07 -0000 Received: (qmail 57913 invoked by uid 500); 20 Apr 2009 05:19:05 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 57823 invoked by uid 500); 20 Apr 2009 05:19:04 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 57813 invoked by uid 99); 20 Apr 2009 05:19:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2009 05:19:04 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.132.245] (HELO an-out-0708.google.com) (209.85.132.245) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2009 05:18:58 +0000 Received: by an-out-0708.google.com with SMTP id c2so287193anc.29 for ; Sun, 19 Apr 2009 22:18:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.202.8 with SMTP id z8mr1429930anf.74.1240204716237; Sun, 19 Apr 2009 22:18:36 -0700 (PDT) In-Reply-To: <49EBDCA3.2010006@comcast.net> References: <49EBDCA3.2010006@comcast.net> From: Aaron Kimball Date: Sun, 19 Apr 2009 22:18:21 -0700 Message-ID: Subject: Re: Are SequenceFiles split? If so, how? To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016369204308f65bf0467f5a9c2 X-Virus-Checked: Checked by ClamAV on apache.org --0016369204308f65bf0467f5a9c2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Yes, there can be more than one InputSplit per SequenceFile. The file will be split more-or-less along 64 MB boundaries. (the actual "edges" of the splits will be adjusted to hit the next block of key-value pairs, so it might be a few kilobytes off.) The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks()) as a hint, not a set-in-stone metric. (The number of reduce tasks, though, is always 100% user-controlled.) If you need exact control over the number of map tasks, you'll need to subclass it and modify this behavior. That having been said -- are you sure you actually need to precisely control this value? Or is it enough to know how many splits were created? - Aaron On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman wrote: > Suppose a SequenceFile (containing keys and values that are BytesWritable) > is used as input. Will it be divided into InputSplits? If so, what's the > criteria use for splitting? > > I'm interested in this because I need to control the number of map tasks > used, which (if I understand it correctly), is equal to the number of > InputSplits. > > thanks, > > bw > --0016369204308f65bf0467f5a9c2--