hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: How can a YarnTask read/write local-host HDFS blocks?
Date Tue, 02 Jul 2013 18:13:07 GMT
Blah blah,
One point you might have missed: multiple tasks cannot all write the same HDFS file at the
same time.  So you can't just split an output file into sections and say "task1 write block1,
etc".  Typically each task outputs a separate file and these file-parts are read or merged

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, June 22, 2013 5:33 AM
To: <user@hadoop.apache.org>
Subject: Re: How can a YarnTask read/write local-host HDFS blocks?


On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tmp5330@gmail.com> wrote:
> Hi all
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn 
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 
> 3.0.0). I can not update to current trunk version, as prototype 
> deadline is soon, and I don't have time to include Yarn API changes.
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
> I have scheduled Containers/Tasks to the hosts which hold input data 
> HDFS blocks to achieve data locality (new 
> AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task 
> schedule is not guaranteed (but lets assume Tasks were scheduled 
> directly to hosts with input HDFS blocks). I have 3 questions 
> regarding reading/writing data from HDFS.
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as 
> its input HDFS block, how can I read the local block? Should I use 
> LocalFileSystem, since HDFS block is stored locally? Any code snippet 
> or source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN where they are running
(and it has the block they request). A developer needn't concern themselves with explicitly
trying to read local data - it is done automatically by the framework.

> 2. Multiple Containers on same Host, how to differ which local block 
> should be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same 
> Host, and also different input HDFS blocks are stored on the same 
> Host. How can I ensure that Container/Task will read "its" HDFS local 
> block. For example INPUT consists of 10 blocks, Job uses 5 nodes, and 
> for each node 2 containers were scheduled, also each node holds 2 
> distinct HDFS blocks. How can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container ID you launch. They then
have to read this assigned file alone. You can pass this read info to them via CLI options,
some serialized file, etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it 
> on the same local host. As far as I know (if I am wrong please correct 
> me), whenever Task writes some data to HDFS, HDFS tries to store it on 
> the same host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? 
> You know that any code snippet or source code reference will be 
> greatly appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
> regards
> tmp

Harsh J

View raw message