hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From blah blah <tmp5...@gmail.com>
Subject Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
Date Fri, 01 Feb 2013 14:24:25 GMT

(I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)

I have a question regarding my assumptions on the Yarn-MR design, specially
the InputSplit processing. Can someone confirm or point out my mistakes in
my MR-Yarn design assumptions?

These are my assumptions regarding design.
1. JobClient submits Job
Create AppMaster etc.
2. Get number of splits // MR-AM, specially their hosts, so that a Task can
be started on the same node, use *InputFormat.getSplts() { ...;
FileSystem.getFileBlockLocations(); ...;}
3. Start N tasks // MR-AM
4. Each Task processes its (single) split (unless splitsNr >> tasksNr) with
the use of InputFormat/RecordReader // MR-Task, from HERE InputFormat
operates only on a single Split
5. Start RecordReader and process Split // MR-Task
5. MAP() // MR-Task
6. Do rest MR // MR-Task
7. Dump to HDFS/or other storage. // MR-Task
8. Report FINISH, free resources // MR-AM

2 quick bonus questions

I have added additional log entry in the FileInputFormat.getSplits(),
however I can not see it in log files. I am using WordCount example and
INFO level. What might be the problem?
In the FileSystem.getFileBlockLocations() the hostname is hard-coded as
"localhost", where this is mapped to the actual host name, so that AM will
know which nodes to request?

Thanks for reply

View raw message