hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Wang <jason.j.w...@gmail.com>
Subject Hadoop streaming inserts tabs into mapper output
Date Thu, 18 Oct 2012 20:12:24 GMT
With hadoop streaming and no reducer, I would expect the output written to
HDFS to be the exact STDOUT from the mapper.  I noticed that tab characters
(0x9) are getting inserted before every new line character (0xa).  This is
problematic for me because the output of my mapper is binary data which I
would like to be written to HDFS unaltered.

I've narrowed my issue down to a very simple example that anybody can run.
 Create a simple test.txt file with 4 or more lines of text (must have
newline characters to exemplify the problem).  Copy this to HDFS, and run a
simple streaming job with "cat" as the mapper:

hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -input
/Users/hadoop/test/test.txt -output /Users/hadoop/test/output -mapper "cat"
-reducer NONE

Copy the output/part-00000 file to local, and hexdump the file.  You'll
notice that 0xA bytes have become 0x9 0xA.

There must be a parameter to streaming that can fix this, but I have not
been able to find it.

Thanks in advance,

View raw message