Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of owen.omalley@gmail.com
 designates 209.85.198.224 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:from:to:in-reply-to:content-type:mime-version:subject
         :date:references:x-mailer:sender;
        b=AbMDZ1GPas0HP8agGsDEw+gedyVP5/GI6U5aQAVL9lHjK+Trsnl4ZokQm7RQgqhh6z
         UPSK+xLpBlMaswWpLqASmvTq7lPZtZBUmt+7/00c8v0graFOXjdeb+fVrVnDWpTmq1NU
         VymIWVd9WyF/UZ3jIXvUqVNrWAHAgIkQX0pGk=
Message-Id: <88C928C9-0F4E-4688-BB6E-A3005A53C048@apache.org>
From: Owen O'Malley <omalley@apache.org>
To: core-user@hadoop.apache.org
In-Reply-To: <4a41ceba0809150613m3da35b6fk6c7999db0ecf4d14@mail.gmail.com>
Content-Type: multipart/alternative; boundary=Apple-Mail-2-524154080
Mime-Version: 1.0 (Apple Message framework v926)
Subject: Re: Implementing own InputFormat and RecordReader
Date: Mon, 15 Sep 2008 09:43:26 -0700
References: <4a41ceba0809150613m3da35b6fk6c7999db0ecf4d14@mail.gmail.com>
Sender: Owen O'Malley <owen.omalley@gmail.com>

--Apple-Mail-2-524154080
Content-Type: text/plain;
	charset=ISO-8859-1;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: quoted-printable

On Sep 15, 2008, at 6:13 AM, Juho M=E4kinen wrote:

> 1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
> input file is 128MB and my HDFS block size is 64MB, will it return one
> InputSplit or two InputSplits?

Your InputFormat needs to define:

protected boolean isSplitable(FileSystem fs, Path filename) {
   return false;
}

which tells the FileInputFormat.getSplits to not split files. You will =20=

end up with a single split for each file.

> 2) If my file is splitted into two or more filesystem blocks, how will
> hadoop handle the reading of those blocks? As the file must be read in
> sequence, will hadoop first copy every block to a machine (if the
> blocks aren't already in there) and then start the mapper in this
> machine? Do I need to handle the reading and opening multiple blocks,
> or will hadoop provide me a simple stream interface which I can use to
> read the entire file without worrying if the file is larger than the
> HDFS block size?

HDFS transparently handles the data motion for you. You can just use =20
FileSystem.open(path) and HDFS will pull the file from the closest =20
location. It doesn't actually move the block to your local disk, just =20=

gives it to the application. Basically, you don't need to worry about =20=

it.

There are two downsides to unsplitable files. The first is that if =20
they are large, the map times can be very long. The second is that the =20=

map/reduce scheduler tries to place the tasks close to the data, which =20=

it can't do very well if the data spans blocks. Of course if data =20
isn't splitable, you don't have a choice.

-- Owen=

--Apple-Mail-2-524154080--