Streaming Documentation Update
------------------------------
Key: HADOOP-4182
URL: https://issues.apache.org/jira/browse/HADOOP-4182
Project: Hadoop Core
Issue Type: Improvement
Components: contrib/streaming
Affects Versions: 0.19.0
Reporter: Abdul Qadeer
Priority: Minor
Fix For: 0.19.0
When Text input data is used with streaming, every line is expected to end with a newline.
Hadoop results are undefined if input files do not end in a newline. (The results will depend
on how files are assigned to mappers.)
Example:
In streaming if
mapper = xargs cat
reducer = cat
and the input is a two line, where each line is symbolic link in HDFS
link1\n
link2\n
EOF
link1 points to a file which contains
This is line1EOF
link2 points to a file which contains
This is line2EOF
Now running a streaming job such that, there is only one split, will produce results:
This is line1This is line2\t\n
But if there were two splits, the result will be
This is line1\t\n
This is line2\t\n
So in summary, the output depends on the factor that how many mappers were invoked. As a
caution, it should be recorded in Streaming wiki that users always put a new line at the end
of each line to get away with such problems.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|