hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjun Bakshi <baksh...@mail.uc.edu>
Subject Re: Building custom block placement policy. What is srcPath?
Date Thu, 24 Jul 2014 21:25:06 GMT

Thanks for the reply. It cleared up a few things.

I hadn't thought of situations of under-replication, but I'll give it 
some thought now. It should be easier since, as you've mentioned, by 
that time the namenode knows all the blocks that came from the same file 
as the under-replicated block.

For the most part, I was thinking of when a new file is being placed on 
the cluster. I think this is what you called in-progress files. Say a 
new 1GB file needs to be placed on to the cluster. I want to make the 
system take information of the file being 1GB in size into account while 
placing all its blocks on to nodes in a cluster.

I'm not clear on where the file is broken down into blocks/chunks; in 
terms of which class, which file system(local or hdfs), or where in the 
process flow. Knowing that will help me come up with a solution. Where 
is the last place, in terms of a function or point in process that I can 
find the name of the original file that is being placed on the system?

I'm reading the namenode and fsnamesystem code just to see if I can do 
what I want from there. Any suggestions will be appreciated.

Thank you,


On 07/24/2014 02:12 PM, Harsh J wrote:
> Hello,
> (Inline)
> On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi <bakshian@mail.uc.edu> wrote:
>> Hi,
>> I want to write a block placement policy that takes the size of the file
>> being placed into account. Something like what is done in CoHadoop or BEEMR
>> paper. I have the following questions:
>> 1- What is srcPath in chooseTarget? Is it the path to the original
>> un-chunked file, or it is a path to a single block, or something else? I
>> added some code to blockplacementpolicydefault to print out the value of
>> srcPath but the results look odd.
> The arguments are documented in the interface javadoc:
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61
> The srcPath is the file path of the file on HDFS for which the block
> placement targets are being requested.
>> 2- Will a simple new File(srcPath) will do?
> Please rephrase? The srcPath is not a local file if thats what you meant.
>> 3- I've spent time looking at hadoop source code. I can't find a way to go
>> from srcPath in chooseTarget to a file size. Every function I think can do
>> it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
>> called from inside the blockmanagement package or blockplacement class.
> The block placement is something that, within a context of a new file
> creation, is called when requesting a new block. At this point the
> file is not complete, so there is no way to determine its actual
> length, but only the requested block size. I'm not certain if
> BlockPlacementPolicy is what will solve your goal.
>> How do I go from srcPath in blockplacement class to size of the file being
>> placed?
> Are you targeting in-progress files or completed files? The latter
> form of files would result in placement policy calls iff there's an
> under-replication/losses/etc. to block replicas of the original set.
> Only for such operations would you have a possibility to determine the
> actual full length of file (as explained above).
>> Thank you,
>> AB

View raw message