hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: large block size problem
Date Mon, 16 Mar 2009 17:55:33 GMT

On Mar 16, 2009, at 11:03 AM, Owen O'Malley wrote:

> On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote:
>
>> I spoke with someone from the local university on their High Energy  
>> Physics problems last week -their single event files are about 2GB,  
>> so that's the only sensible block size to use when scheduling work.  
>> He'll be at ApacheCon next week, to make his use cases known.
>
> I don't follow. Not all files need to be 1 block long. If your files  
> are 2GB, 1GB blocks should be fine and I've personally tested those  
> when I've wanted to have longer maps. (The block size of a dataset  
> is the natural size of the input for each map.)
>

Hm ... I work on the same project and I'm not sure I agree with this  
statement.

The problem is that the files contain independent event data from a  
particle detector (about 1 - 2MB / event).  However, the file  
organization is such that it's not possible to split the file at this  
point (not to mention that it takes quite some overhead to startup the  
process)

Turning the block size way up would mean that any jobs could keep data  
access completely node-local.  OTOH, this probably defeats one of the  
best advantages for using HDFS: block-decomposition mostly solves the  
"hot spot" issue.  Ever seen what happens to a file system when a user  
submits 1000 jobs to analyze a single 2GB file?  Without block- 
decomposition to spread the reads over 20 or so servers, with only one  
block per file, the read happens to 1-3 servers.  Big difference.

Brian 

Mime
View raw message