hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Oskarsson (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HADOOP-4913) When using the Hadoop streaming jar if the reduce job outputs only a value (no key) the code incorrectly outputs the value along with the tab character (key/value) separator.
Date Tue, 19 May 2009 16:32:45 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Johan Oskarsson resolved HADOOP-4913.
-------------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: site)

You can do this in user code by implementing an output format that ignores the key and only
saves the value. Have a look at TextOutputFormat for guidance.

> When using the Hadoop streaming jar if the reduce job outputs only a value (no key) the
code incorrectly outputs the value along with the tab character (key/value) separator.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4913
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4913
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.18.2
>         Environment: Red Hat Linux 5.
>            Reporter: John Fisher
>            Priority: Minor
>
> I would like the output of my streaming job to only be the value, omitting the key and
key/value separator.  However, when only printing the value I am noticing that each line is
ending with a tab character.  I believe I have tracked down the issue (described below) but
I'm not 100% sure.  The fix is working for me though so I figured maybe it should be incorporated
into the code base.
> The tab gets printed out because of a bad check in the TextOutputFormat code.  It checks
if the "key" and "value" objects are null.  If they are both not null, then that means that
the line should be printed as <key><separator><value>, otherwise it should
only print the key or value, depending on what is defined.  The bug is that the key and value
are always defined.  I traced up further to see if the error was that these objects were defined
when they shouldn't be, but it looks like that's how it should work.  I changed the Hadoop
code to look for a null object and also an empty string length.
> *** Patch code begin ***
> if( ! nullKey ) {
>   nullKey = ( key.toString().length() == 0 );
> }
> if( ! nullValue ) {
>   nullValue = ( value.toString().length() == 0 );
> }
> *** Patch code end ***
> The OutputCollector calls the TextOutputFormat,write() method with whatever objects are
passed into it (see ReduceTask.java, line 300) so that is fine.
> But above that if you look at PipeMapRed.java, in the run() method you will see that
the code creates a new key and value object and then starts reading lines and feeding them
to the OutputCollector.  This is why the key and value are always defined by the time they
hit the TextOutputFormat,write() and why we always see the tab.
> Thanks,
> John

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message