hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. P. Pendleton" ...@geekdom.net>
Subject Re: Creating splits/tasks at the client
Date Thu, 28 Sep 2006 18:24:17 GMT
I'm largely at fault for the "user code running in the JobTracker" that
exists.

I support this change - but, I might reformulate it. Why not make this a
sort of special Job? It can even be formulated roughly like this:

input<JobDescription,FilePaths> -> map(Job,FilePath) ->
reduce(Job,FileSplits) -> SchedulableJob

It might even make sense to do an extra run that pre-computes cached
locations of FileSplits, although I think that is still bottlenecked by the
NameNode.

On 9/28/06, Doug Cutting <cutting@apache.org> wrote:
>
> Benjamin Reed wrote:
> > One of the things that bothers me about the JobTracker is that it is
> > running user code when it creates the FileSplits. In the long term this
> > puts the JobTracker JVM at risk due to errors in the user code.
>
> JVM's are supposed to be able to do this kind of stuff securely.  Still,
> we don't currently leverage this much, and the JVM's security is
> limited, so it is a valid concern.
>
> Note that, while we do avoid running user code in tasktrackers (mapping,
> sorting and reducing are done in a subprocess) they're still run as a
> system user id.  So security issues are to some degree unavoidable.
>
> But in terms of inadvertant denial of service, running user code in the
> job tracker, a single-point-of-failure, does make the system more fragile.
>
> > The JobTracker uses the InputFormat to create a set of tasks that it
> > then schedules. The task creation does not need to happen at the
> > JobTracker. If we allowed the clients to create the set of tasks, the
> > JobTracker would not need to load and run any user generated code. It
> > would also remove some of the processing load from the JobTracker. On
> > the downside it does greatly increase the amount of information sent to
> > the JobTracker when a job is submitted.
>
> Right, so JobSubmissionProtocol.submitJob(String jobFile) could be
> altered to be submitJob(StringJobFile, Split[]).  The RPC system can
> handle reasonably large values like this, so I don't think that would be
> a problem.  But the memory impact on the JobTracker could become
> significant, since the splits for queued jobs would now be around.  This
> could be mitigated by writing the splits to a temporary file.
>
> The semantics would be subtly different: if you queue a job now, the
> file listing is done just before the job is executed, not when its
> submitted.  But programs shouldn't rely on that, so I don't think this
> is a big worry.
>
> Overall, I don't see any major problems with this.  It won't simplify
> things much.  We can remove the code which computes splits in a separate
> thread, but we'd have to add code to store splits to temporary files, so
> codesize is a wash.  And it would remove a potential reliability problem.
>
> Doug
>



-- 
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message