hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Anderson <...@monkey.org>
Subject Re: too many open files? Isn't 4K enough???
Date Wed, 12 Nov 2008 23:47:28 GMT

On 5-Nov-08, at 4:08 PM, Yuri Pradkin wrote:

> I suspect your total open FDs = (#mappers) x (FDs/map)
>
> In my case the second factor was ~5K; so if I ran 8 mappers total  
> might have
> been as high as 40K!  This is totally insane.
>
> Perhaps playing with GC modes might help...
>
>> In general, I've had to do a lot of fine-tuning of my job paramaters
>> to balance memory, file handles, and task timeouts.  I'm finding that
>> a setup that works with one input set breaks when I try it on an  
>> input
>> set which is twice the size.  My productivity is not high while I'm
>> figuring this out, and I wonder why I don't hear about this more.
>> Perhaps this is a streaming issue, and streaming isn't being used  
>> very
>> much?
>
> I doubt in my case this is a specific to streaming, although  
> streaming might
> exacerbate the problem by opening pipes, etc.  In my case the vast  
> majority
> of open files were to spills during sorting/shuffling which is not  
> restricted
> to streaming.
>
> This is a scalability issue and I'd really like to hear from  
> developers.
>
>  -Yuri
>
> P.S. It looks like we need to file a jira on this one...

Are you able to create a reproducible setup for this?  I haven't been  
able to.

I'm only able to cause this to happen after a few runs of my own jobs
first, which do various things and involve several Python libraries
and downloading from S3.  After I've done this, it looks like any
streaming job will have tasks die, but if I don't run my jobs first, I
don't have a problem.  I also can't figure out what's consuming the
open files; I'm not seeing the large lsof numbers that you were.

Obviously, the jobs I'm running beforehand are causing problems for  
later
jobs, but I haven't isolated what it is yet.


My cluster:
- hadoop 0.18.1
- cluster of 64 EC2 xlarge nodes, created with the hadoop-ec2 tools,  
edited
   to increase the max open files for root to 131072
- 8 max mappers or reducers per node

After I had some of my jobs die, I tested the cluster with this  
streaming job:

   hadoop jar /usr/local/hadoop-0.18.1/contrib/streaming/hadoop-0.18.1- 
streaming.jar -mapper cat -reducer cat  -input clusters_0 -output foo - 
jobconf mapred.output.compress=false -jobconf mapred.map.tasks=256 - 
jobconf mapred.reduce.tasks=256

Ran this manually a few times, not changing anything other than  
deleting the
output directory and never running more than one job at once.
While I ran it, I checked the number of open files on two of the nodes  
with:

   while true; do lsof | wc -l; sleep 1; done

Tasks died on each job due to "file not found" or "too many open  
files" errors.
Each job succeeded eventually.
The job never got more than 120 or so mappers or reducers at once  
(because
the scheduler couldn't catch up; a real job on this cluster setup was  
able
to get to 8 tasks per node).
1st run: 31 mappers die, 11 reducers die.
2nd run: 16/12
3rd run: 14/6
4th run: 14/6

Never saw more than 1600 or so open files on the two nodes I was  
checking.
Tasks were dying on these nodes during this time.

The input directory (clusters_0) contained one 797270 byte, 4096 line  
ASCII
file.

I terminated and re-created my cluster.  This time I just uploaded the  
input
file and ran the test jobs, I didn't run my jobs first.
I wasn't able to cause any errors.




Karl Anderson
kra@monkey.org
http://monkey.org/~kra




Mime
View raw message