Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EFF1F10499 for ; Mon, 23 Sep 2013 11:31:58 +0000 (UTC) Received: (qmail 56253 invoked by uid 500); 23 Sep 2013 11:31:48 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 56194 invoked by uid 500); 23 Sep 2013 11:31:37 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 56164 invoked by uid 99); 23 Sep 2013 11:31:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Sep 2013 11:31:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.214.178 as permitted sender) Received: from [209.85.214.178] (HELO mail-ob0-f178.google.com) (209.85.214.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Sep 2013 11:31:25 +0000 Received: by mail-ob0-f178.google.com with SMTP id uy5so3371296obc.9 for ; Mon, 23 Sep 2013 04:31:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=RCmyimGoDkBnQ5fPdf6RrahbOXo5ainxanOIvsTEde8=; b=RUZ7MlSaPL+cCV5OM8AAeTUkQxucHps6r9HhXd+2RS2ekZltAtf8ya/7tcNHCHQSBc +N6zpeOsWDXYwJ0AlssHN7e1W7PZ/Bd61X4noc1I/9zplAnShsDTvwfs6tbJ+P5kW29c GZIcYhQyRXEuLjJJuiwp2S/ZQMsUJbHwIitwGnupCILdU7oEz8eD5QFAjp145YjKPEQL G6fusUxuXW5A6uTXBEePzI2iXs8ipHXCMgB0BXpZ02d83h4y+619Q82ibMXc44k3HyB6 PqsAMJfdlBs8MjSOFqzAxV3TWq+WAnnoNtv74ZLKbegtILNE/Fvr2LlMpxJ06LX8K7KF D1zg== X-Gm-Message-State: ALoCoQlaDiZf2pm2w+7n3WNBWGEBLdNDCzdE2ZicnDyepuTMNOZXu1Fk5XGkhUYuqpR7pyszKklK X-Received: by 10.182.80.196 with SMTP id t4mr19450822obx.1.1379935864633; Mon, 23 Sep 2013 04:31:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.95.105 with HTTP; Mon, 23 Sep 2013 04:30:44 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Mon, 23 Sep 2013 17:00:44 +0530 Message-ID: Subject: Re: A couple of Questions on InputFormat To: mapreduce-user Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi, (I'm assuming 1.0~ MR here) On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis wrote: > Classes implementing InputFormat implement > public List getSplits(JobContext job) which a List if > InputSplits. for FileInputFormat the Splits have Path.start and End > > 1) When is this method called and on which JVM on Which Machine and is it > called only once? Called only at a client, i.e. your "hadoop jar" JVM. Called only once. > 2) Do the number of Map task correspond to the number of splits returned by > getSplits? Yes, number of split objects == number of mappers. > 3) InputFormat implements a method > RecordReader createRecordReader(InputSplit split,TaskAttemptContext > context ). Is this executed within the JVM of the Mapper on the slave > machine and does the RecordReader run within that JVM RecordReaders are not created on the client side JVM. RecordReaders are created on the Map task JVMs, and run inside it. > 4) The default RecordReaders read a file from the start position to the end > position emitting values in the order read. With such a reader, assume it is > reading lines of text, is it reasonable to assume that the values the mapper > received are in the same order they were found in a file? Would it, for > example, be possible for WordCount to see a word that was hyphen- > ated at the end of one line and append the first word of the next line it > sees (ignoring the case where the word is at the end of a split) If you speak of the LineRecordReader, each map() will simply read a line, i.e. until \n. It is not language-aware to understand meaning of hyphens, etc.. You can implement a custom reader to do this however - there should be no problems so long as your logic covers the case of not having any duplicate reads across multiple maps. -- Harsh J