hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amareshwari Sriramadasu (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2806) Streaming has no way to force entire record (or null) as key
Date Fri, 14 Mar 2008 05:13:24 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Amareshwari Sriramadasu updated HADOOP-2806:
--------------------------------------------

    Attachment: patch-2806.txt

If there is no tab in the line, entire line is read as key and value is null. 
This behavior should be fine if document that. Here is a patch that adds the documentation.

> Streaming has no way to force entire record (or null) as key
> ------------------------------------------------------------
>
>                 Key: HADOOP-2806
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2806
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>            Reporter: Marco Nicosia
>            Assignee: Amareshwari Sriramadasu
>            Priority: Minor
>             Fix For: 0.17.0
>
>         Attachments: patch-2806.txt
>
>
> I think perhaps streaming needs a "-allkey" or "-nullkey" option? Otherwise, I'm concerned
there is a subtle streaming documentation problem.
> These two docs:
> http://hadoop.apache.org/core/docs/current/streaming.html
> http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged with above?)
> ... seem to ignore that streaming, by default, splits key/value on TAB. Sure, they mention
it, but in all the simple (no separator) examples, they don't seem to take into account that
streaming may inconsistently decide whether the whole line is the key, or just up to the first
tab, should one occur. This means that some records might be sorted differently as compared
to others based on whether or not there's a tab?
> Here's a very simple pair of examples, that to the naive, should produce the same output,
but do not:
> > [hod] (marco) >> run dfs -fs local -cat str-tabs
> > a       1
> > b       3
> > a       4
> > 
> > [hod] (marco) >> run dfs -put str-tabs str-tabs
> > 
> > [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output str-tabs.out
-mapper /bin/cat -reducer /bin/cat     
> > [blah blah blah]
> > 
> > [hod] (marco) >> run dfs -cat str-tabs.out/part-00000
> > a       4
> > a       1
> > b       3
> Compare to this negative-test:
> > [hod] (marco) >> run dfs -fs local -cat str-notabs
> > a 1
> > b 3
> > a 4
> > 
> > [hod] (marco) >> run dfs -put str-notabs str-notabs
> > 
> > [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output str-notabs.out
-mapper /bin/cat -reducer /bin/cat
> > [blah blah blah]
> > 
> > [hod] (marco) >> run dfs -cat str-notabs.out/part-00000
> > a 1
> > a 4
> > b 3
> > 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message