hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amar Kamat <ama...@yahoo-inc.com>
Subject Re: Bugs in 0.16.0?
Date Mon, 03 Mar 2008 16:46:30 GMT
HADOOP is not meant for real time applications. Its more or less designed 
for long running applications like crawlers/indexers.
On Mon, 3 Mar 2008, Spiros Papadimitriou wrote:

> Hi
> I'd be interested to know if you've tried to use Hadoop for a large number
> of short jobs.  Perhaps I am missing something, but I've found that the
> hardcoded Thread.sleep() calls, esp. those for 5 seconds in
> mapred.ReduceTask (primarily) and mapred.JobClient, cause more of a problem
> than the 0.3 sec or so that it takes to fire up a JVM.
> Agreed that for long running jobs that is not a concern, but *if* we'd want
> to speed things up for shorter running jobs  (say < 1 min) is a goal, then
> JVM reuse would seem to be a lower priority?  Would doing something about
> those sleep()s seem worthwhile?
> Thanks,
> Spiros
> On Sat, Mar 1, 2008 at 4:33 PM, Owen O'Malley <oom@yahoo-inc.com> wrote:
>> On Mar 1, 2008, at 12:05 PM, Amar Kamat wrote:
>>>> 3) Lastly, it would seem beneficial for jobs that have significant
>>>> startup overhead and memory requirements to not be run in separate
>>>> JVMs for each task.  Along these lines, it looks like someone
>>>> submitted a patch for JVM-reuse a while back, but it wasn't
>>>> commited? https://issues.apache.org/jira/browse/HADOOP-249
>> Most of the ideas in the patch for 249 were committed as other
>> patches, but that bug has been left open precisely because the idea
>> still has merit. The patch was never stable enough to commit and now
>> is hopelessly out of date. There are lots of little issues that would
>> need to be addressed for this to happen.
>>>> Probably a question for the dev mailing list, but if I wanted to
>>>> modify hadoop to allow threading tasks, rather than running
>>>> independent JVMs, is there any reason someone hasn't done this
>>>> yet?  Or am I overlooking something?
>>> This is done to keep user code separate from the framework code.
>> Precisely. We don't want to go through the security manager in the
>> servers, so it is far easier to keep user code out of the servers.
>>> So if the user code develops a fault the framework and rest of the
>>> jobs function normally. Most of the jobs have a longer run time and
>>> hence the startup time is never a concern.
>> As long as the tasks belong to the same job (and therefore user),
>> sharing a jvm should be fine. One concern is that currently each task
>> gets its own working directory. Since Java can't change working
>> directory in a running process, it would have to clean up the working
>> directory. That will interact badly with debugging settings that let
>> you keep the task files. However, as we speed things up, it will
>> become more important. Already we are starting to see sort maps that
>> finish in 17 seconds,  which means the 1 second of jvm startup is a
>> 5% overhead...
>> -- Owen

View raw message