hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Creating splits/tasks at the client
Date Thu, 28 Sep 2006 18:13:44 GMT
Benjamin Reed wrote:
> One of the things that bothers me about the JobTracker is that it is
> running user code when it creates the FileSplits. In the long term this
> puts the JobTracker JVM at risk due to errors in the user code.

JVM's are supposed to be able to do this kind of stuff securely.  Still, 
we don't currently leverage this much, and the JVM's security is 
limited, so it is a valid concern.

Note that, while we do avoid running user code in tasktrackers (mapping, 
sorting and reducing are done in a subprocess) they're still run as a 
system user id.  So security issues are to some degree unavoidable.

But in terms of inadvertant denial of service, running user code in the 
job tracker, a single-point-of-failure, does make the system more fragile.

> The JobTracker uses the InputFormat to create a set of tasks that it
> then schedules. The task creation does not need to happen at the
> JobTracker. If we allowed the clients to create the set of tasks, the
> JobTracker would not need to load and run any user generated code. It
> would also remove some of the processing load from the JobTracker. On
> the downside it does greatly increase the amount of information sent to
> the JobTracker when a job is submitted.

Right, so JobSubmissionProtocol.submitJob(String jobFile) could be 
altered to be submitJob(StringJobFile, Split[]).  The RPC system can 
handle reasonably large values like this, so I don't think that would be 
a problem.  But the memory impact on the JobTracker could become 
significant, since the splits for queued jobs would now be around.  This 
could be mitigated by writing the splits to a temporary file.

The semantics would be subtly different: if you queue a job now, the 
file listing is done just before the job is executed, not when its 
submitted.  But programs shouldn't rely on that, so I don't think this 
is a big worry.

Overall, I don't see any major problems with this.  It won't simplify 
things much.  We can remove the code which computes splits in a separate 
thread, but we'd have to add code to store splits to temporary files, so 
codesize is a wash.  And it would remove a potential reliability problem.


View raw message