hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdul Qadeer (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-4182) Streaming Documentation Update
Date Tue, 16 Sep 2008 05:37:44 GMT
Streaming Documentation Update
------------------------------

                 Key: HADOOP-4182
                 URL: https://issues.apache.org/jira/browse/HADOOP-4182
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/streaming
    Affects Versions: 0.19.0
            Reporter: Abdul Qadeer
            Priority: Minor
             Fix For: 0.19.0


When Text input data is used with streaming, every line is expected to end with a newline.
 Hadoop results are undefined if input files do not end in a newline.  (The results will depend
on how files are assigned to mappers.)

Example:

In streaming if

mapper = xargs cat
reducer = cat

and the input is a two line, where each line is symbolic link in HDFS

link1\n
link2\n
EOF

link1 points to a file which contains

This is line1EOF

link2 points to a file which  contains

This is line2EOF

Now running a streaming job such that, there is only one split, will produce results:

This is line1This is line2\t\n

But if there were two splits, the result will be

This is line1\t\n
This is line2\t\n

So in summary, the output depends on the factor that how many mappers were invoked.  As a
caution, it should be recorded in Streaming wiki that users always put a new line at the end
of each line to get away with such problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message