hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Navis Ryu" <navis....@nexr.com>
Subject Review Request 26917: Make OrcNewInputFormat return row number as a key
Date Mon, 20 Oct 2014 09:17:32 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26917/
-----------------------------------------------------------

Review request for hive.


Bugs: trunk
    https://issues.apache.org/jira/browse/trunk


Repository: hive-git


Description
-------

Key is null in map when OrcNewInputFormat is used as Input Format Class

When using OrcNewInputFormat as input format class for my map reduce job, I find its key is
always null in my map method. This gives me no way to get row number in my map method.  If
you compare RCFileInputFormat (for RC file), its key in map method returns the row number
so I know which row I am processing. 

Is there any workaround for me to get the row number from my map method?  Of course, I can
count the row number by myself.  But that has two problems: #1 I have to assume the row is
coming in the order; #2 I will get duplicated (and wrong) row numbers if a big input file
causes multiple file splits (which will trigger my map method multiple times in different
data nodes).   At this point, I am really seeking a better way to get row number for each
processed row in map method.

Here is what I have in my map logs:

	[2014-08-06 09:39:25 DEBUG com.xxxx.hadoop.orcfile.OrcFileMap]: Mapper Input Key: (null)
	[2014-08-06 09:39:25 DEBUG com.xxxx.hadoop.orcfile.OrcFileMap]: Mapper Input Value: {Q81510000,
T99760000, 699760000, 81567560000, 9667981610000, 978989898980000, Laura, Lauraxxx@gmail.com}

My map method is:

	protected void map(Object key, Writable value, Context context)
			throws IOException, InterruptedException {
		logger.debug("Mapper Input Key: " + key);
		logger.debug("Mapper Input Value: " + value.toString());
		.....
	}

The fix should be: add  following statement in nextKeyValue() method and pass the result all
the way up to the map() method as its key:

          reader.getRowNumber(); 


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcNewInputFormat.java b6ad0dc 

Diff: https://reviews.apache.org/r/26917/diff/


Testing
-------


Thanks,

Navis Ryu


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message