hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexis (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-2229) Initialize reader in Sort example
Date Thu, 23 Dec 2010 03:21:02 GMT
Initialize reader in Sort example
---------------------------------

                 Key: MAPREDUCE-2229
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2229
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: examples
    Affects Versions: 0.21.0
            Reporter: Alexis


As described in paragraph "Total Sort" in HTDG book, page 223, I tried to create a Hadoop
job to sort globally some input, using InputSampler with TotalOrderPartitioner.

Please run the mapreduce Sort example with the following arguments to reproduce the exception.
{noformat}
org.apache.hadoop.examples.Sort
	-r 2
	-outKey org.apache.hadoop.io.Text
	-outValue org.apache.hadoop.io.Text
	-inFormat org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
	-outFormat org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
	-totalOrder 0.1 10000 10
	test/sortInput
	test/sortOutput
{noformat}

The issue is already described there:
- http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201011.mbox/%3CDB1B07B75C01FB40B814678DEE6E0085175C86CDFF@bdc.taomee-ex.com%3E
- http://www.mail-archive.com/mapreduce-user@hadoop.apache.org/msg01372.html

This is a somewhat related comment:
http://www.mail-archive.com/common-user@hadoop.apache.org/msg03947.html

We need to initialize the reader to avoid the NPE occuring when generating the partition file:
{noformat}
Exception in thread "main" java.lang.NullPointerException
	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
	at org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader.nextKeyValue(KeyValueLineRecordReader.java:91)
	at org.apache.hadoop.mapreduce.lib.partition.InputSampler$RandomSampler.getSample(InputSampler.java:220)
	at org.apache.hadoop.mapreduce.lib.partition.InputSampler.writePartitionFile(InputSampler.java:315)
	at org.apache.hadoop.examples.Sort.run(Sort.java:166)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
{noformat} 

Right now, this initialization only happens in runNewMapper in org.apache.hadoop.mapred.MapTask,
but the sampling is performed before the job started. TeraInputFormat class for the TeraSort
has its own writePartitionFile method. This is the javadoc comment of createRecordReader method
in InputFormat class:
{noformat}
   * Create a record reader for a given split. The framework will call
   * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
   * the split is used.
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message