hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: Creating splits/tasks at the client
Date Fri, 29 Sep 2006 16:15:41 GMT
No, even with user defined Splits we don't need to use user code in the
JobTracker if we make Split a Writable class that has the hosts array.

Split will write the hosts first, so in the JobTracker, when you get the
byte array representing the Split, any fields from the sub class will
follow the Split serialized bytes. The JobTracker can skip the Type in
the bytes representing the serialized Split and then deserialize just a
Split (ignoring the rest). You can make this process robust by putting a
fingerprint at the beginning and end of the serialized part of Split, so
that you can detect user defined Splits that change the serialization
order. (This is another example of why Writable is cooler than
Serializable. It would be really hard to just deserialize a super class
from a serialized sub class using Java serialization.)

You would ship the full byte array to the task trackers so that the
InputFormats running in Childs can deserialize the full type.


Owen O'Malley wrote:
> On Sep 29, 2006, at 12:20 AM, Benjamin Reed wrote:
>> I please correct me if I'm reading the code incorrectly, but it seems
>> like submitJob puts the submitted job on the jobInitQueue which is
>> immediately dequeued by the JobInitThread and then initTasks() will get
>> the file splits and create Tasks. Thus, it doesn't seem like there is
>> any difference in memory foot print.
> Agreed, it won't cost more memory. In fact, it will be less because we
> won't have the init task thread running and creating InputFormats and
> running user code. Of course, once we allow user-defined InputSplits
> we will be back in exactly the same boat of running user-code on the
> JobTracker, unless we also ship over the preferred hosts for each
> InputFormat too.
> -- Owen

View raw message