hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Bailliez <sbaill...@gmail.com>
Subject Re: Running tasks in the TaskTracker VM
Date Tue, 20 Mar 2007 13:58:59 GMT
Torsten Curdt wrote:
>> Being a complete idiot for distributed computing, I would say it is 
>> easy to explode a JVM when doing such distributed jobs, (should it be 
>> for OOM or anything).
> Then restrict what people can do - at least Google went that route.

I don't know what Google did on the specifics :)

If you want to do that with Java and restrict memory usage, cpu usage 
and descriptor access within each inVM instance. That's a considerable 
amount of work that likely implies writing a specific agent for the vm 
(or an agent for a specific vm that is, because it's pretty unlikely 
that you will get the same results across vms), assuming that can then 
really be done at the classloader level for each task (which is pretty 
insanely complex to me if you have to consider allocation done at the 
parent classloader level, etc..)

At least by forking a vm you can afford to get some reasonably bound 
control over the resources usage (or at least memory) without bringing 
down everything since a vm is already bound to some degrees.

>> Failing jobs are not exactly uncommon and running things in a 
>> sandboxed environment with less risk for the tracker seems like a 
>> perfectly reasonable choice. So yeah, vm pooling certainly makes 
>> perfect sense for it
> I am still not convinced - sorry
> It's a bit like you would like to run JSPs in a separate JVM because 
> they might take down the servlet container.

it is  a bit too extreme in granularity. I think it is more about like 
running n different webapps within the same VM or not. So if one webapp 
is resource hog, separating it would not harm the n-1 other applications 
and you would either create another server instance or move it away to 
another node.

I know of environment with large number of nodes (not related to hadoop) 
where they also reboot a set of nodes daily to ensure that all machines 
are really in working conditions (it's usually when the machine reboots 
due to failure or whatever that someone has to rush to it because some 
service forgot to be registered or things like that, so doing this 
periodic check gives some people better ideas of their response time to 
failure). That depends of operational procedures for sure.

I don't think it should be done in the spirit that everything is perfect 
in the perfect world because we know it is not like that. So there will 
be compromise between safety and performance and having something 
reasonably tolerant to failure is also a performance advantage.

Doing simple things in a task like a deleteOnExit is enough to leak on 
some VMs a few kbs each time and stay there until the vm dies (fixed in 
1.5.0_10 if I remember well). Figuring out things like that in the end 
is likely to take a severe amount of time considering it is an internal 
leak and will not appear in your favorite java profiler either.

Bottom line is that even if you're 100% sure of your code which is quite 
unlikely (at least for me as far as I'm concerned ), you don't know 
third-party code. So without being totally paranoid, this is something 
that cannot be ignored.

-- stephane

View raw message