hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Chansler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2954) In streaming, map-output cannot have empty keys
Date Tue, 25 Mar 2008 03:03:26 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Chansler updated HADOOP-2954:
------------------------------------

    Fix Version/s:     (was: 0.17.0)

> In streaming, map-output cannot have empty keys
> -----------------------------------------------
>
>                 Key: HADOOP-2954
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2954
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.16.0
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Sameer Paranjpye
>
> Here is the analysis, when the mapper and reducer both are /bin/cat,
> default key field separator: '\t' (or tab)
> for ex, if the input line is:
> \tSDSDFIKSDFSDFJS
> the input for the mapper ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS
> -
> the output of the mapper is split into a key, value pair as below:
> (key, value) -> (\tSDSDFIKSDFSDFJS, "")
> (i.e. the value is empty)
> the function which splits the output into key,value pair for
> streaming jobs, ignores the first character of the line
> -
> from the above (key, value) pair, the input for the reducer is:
> (key followed by separator followed by value)
> \tSDSDFIKSDFSDFJS\t
> if the reducer is set to NONE, the above line is the output of
> the map task
> -
> the output of the reducer ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS\t
> -
> if the line starts with the field separator, it is possible that
> the output of the mapper can be assigned to different reducers because
> it is possible that the line contains more than once instance of the
> field separator - for ex:
> input-line=\tABCDEFGH
> key=\tABCDEFGH
> value=
> (value is empty)
> output-line=\tABCDEFGH\t
> line=\tABCDEFGHYH\tJHUHJH
> key=\tABCDEFGHYH
> value=JHUHJH
> output-line=\tABCDEFGHYH\tJHUHJH
> assuming defaults (HashPartitioner), they are likely to be assigned to
> different reducers because the keys are different.
> The streaming contract  says that from beginning of the line upto the first tab is the
key, so key should be empty string. But it is not.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message