Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6774010BD9 for ; Wed, 26 Feb 2014 10:28:28 +0000 (UTC) Received: (qmail 77894 invoked by uid 500); 26 Feb 2014 10:28:13 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 77196 invoked by uid 500); 26 Feb 2014 10:28:01 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 77185 invoked by uid 99); 26 Feb 2014 10:28:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Feb 2014 10:28:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of drdwitte@gmail.com designates 209.85.217.173 as permitted sender) Received: from [209.85.217.173] (HELO mail-lb0-f173.google.com) (209.85.217.173) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Feb 2014 10:27:53 +0000 Received: by mail-lb0-f173.google.com with SMTP id p9so467019lbv.4 for ; Wed, 26 Feb 2014 02:27:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=19IGvB8JMpXi0IXZLyUvlPbigXsJ5Kljbxutl2IDz5Y=; b=e8TVEWIqmtde2HBcoGKM3lX3jdfTDDWdjxHGmsDX59ctXWo6TB+xxWyTon8hT45PRh CAC6mvrDdiHZu3vqNef1p7BqnGJThqgA7gb3I4wZzNUFpKA+3pq4+bWqVSB4QIHWbhpW TT9vfiO8X1RT1C23GSId4v2ll01J6Hzp/Jq0rUlI5V2N4OcE8T9pAEeVk0j8iUhx//fl pkVc2MGqln0BVZgXqBkZpcLZLJ95cp8g13kPA9U11PHKxFgtZw4u9pfTAOlM0O1hF+Jq sctK0iVG0xykVQllArSX+XTMv3C1gpCOCVS34awMCzDQ/J36Eij3cfGydvt6Vnaq74fv o/fQ== MIME-Version: 1.0 X-Received: by 10.152.3.99 with SMTP id b3mr680014lab.61.1393410453442; Wed, 26 Feb 2014 02:27:33 -0800 (PST) Received: by 10.112.144.101 with HTTP; Wed, 26 Feb 2014 02:27:33 -0800 (PST) In-Reply-To: References: Date: Wed, 26 Feb 2014 11:27:33 +0100 Message-ID: Subject: Re: Logic of isSplittable() of class FileInputFormat From: Dieter De Witte To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e01419fe21a937604f34ca748 X-Virus-Checked: Checked by ClamAV on apache.org --089e01419fe21a937604f34ca748 Content-Type: text/plain; charset=ISO-8859-1 No, an example could be that records have a variable number of lines, if you would then allow a file to be split your record may be broken, so then you could override isSplittable to be always false. 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar : > So basically what I can deduce from it is, isSplittable() only applies to > stream compressed files. Right? > > -- > Thanks & Regards, > Sugandha Naolekar > > > > > > On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang wrote: > >> Hi Sugandha, >> >> Take gz file as an example, It is not splittable because of the >> compression algorithm it is used. It can not guarantee that one record is >> located in one block, if one record is in 2 blocks, your program will crash >> since you can not get the whole record. >> >> >> >> >> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar < >> sugandha.n87@gmail.com> wrote: >> >>> Hello, >>> >>> If a single file is split of size 129 MB is split in two halves/blocks >>> of HDFS as the max block size id 128 MB. And each of the blocks is read >>> depending on the InputFormat it supports. Thus, what is the significance of >>> isSplittable() method then? >>> >>> If it is set to false, entire block will be considered as single input >>> split? How will TextInputFormat react to it? >>> >>> >>> -- >>> Thanks & Regards, >>> Sugandha Naolekar >>> >>> >>> >>> >> > --089e01419fe21a937604f34ca748 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
No, an example could be that records have a variable numbe= r of lines, if you would then allow a file to be split your record may be b= roken, so then you could override isSplittable to be always false.


2014-02-26 11= :22 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
So basically what = I can deduce from it is, isSplittable() only applies to stream compressed f= iles. Right?

--
Thanks= &=A0Regards,
Sugandha Naolekar





On Wed, Feb= 26, 2014 at 2:06 PM, Jeff Zhang <jezhang@gopivotal.com>= wrote:
Hi Sugandha,

Take gz file as an example= , It is not splittable because of the compression algorithm it is used. =A0= It can not guarantee that one record is located in one block, if one record= is in 2 blocks, your program will crash since you can not get the whole re= cord.



On Wed, Feb 26, 2014 at 1:24 PM, Sugandha = Naolekar <sugandha.n87@gmail.com> wrote:
Hello,

If a= single file is split of size 129 MB is split in two halves/blocks of HDFS = as the max block size id 128 MB. And each of the blocks is read depending o= n the InputFormat it supports. Thus, what is the significance of isSplittab= le() method then?

If it is set to false, entire block will be considered a= s single input split? How will TextInputFormat react to it?


--
<= span style=3D"border-collapse:collapse;font-size:13px">Thanks &=A0Regar= ds,
Sugandha Naolekar


=




--089e01419fe21a937604f34ca748--