hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arkady borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Created: (HADOOP-2806) Streaming has no way to force entire record (or null) as key
Date Mon, 11 Feb 2008 17:37:02 GMT
There are two work-arounds for this:
(a) specify a different field separator
     -jobconf stream.map.output.field.separator=.
    I hope it takes any character, including \0

(b) specify that your "records" have a lot of fields
     -jobconf stream.num.map.output.key.fields=999
    (I hope this works...)

Although both these are "work-arounds" they do not seem to look any  
worse than the general ways we specify Streaming Options.

Hopefully, this is going to be better once Streaming is in Pig

--ab



On Feb 9, 2008, at 5:04 PM, Marco Nicosia (JIRA) wrote:

> Streaming has no way to force entire record (or null) as key
> ------------------------------------------------------------
>
>                  Key: HADOOP-2806
>                  URL: https://issues.apache.org/jira/browse/ 
> HADOOP-2806
>              Project: Hadoop Core
>           Issue Type: Bug
>           Components: contrib/streaming
>             Reporter: Marco Nicosia
>             Priority: Minor
>              Fix For: 0.17.0
>
>
> I think perhaps streaming needs a "-allkey" or "-nullkey" option?  
> Otherwise, I'm concerned there is a subtle streaming documentation  
> problem.
>
> These two docs:
>
> http://hadoop.apache.org/core/docs/current/streaming.html
> http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged  
> with above?)
>
> ... seem to ignore that streaming, by default, splits key/value on  
> TAB. Sure, they mention it, but in all the simple (no separator)  
> examples, they don't seem to take into account that streaming may  
> inconsistently decide whether the whole line is the key, or just up  
> to the first tab, should one occur. This means that some records  
> might be sorted differently as compared to others based on whether  
> or not there's a tab?
>
> Here's a very simple pair of examples, that to the naive, should  
> produce the same output, but do not:
>
>> [hod] (marco) >> run dfs -fs local -cat str-tabs
>> a       1
>> b       3
>> a       4
>>
>> [hod] (marco) >> run dfs -put str-tabs str-tabs
>>
>> [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs - 
>> output str-tabs.out -mapper /bin/cat -reducer /bin/cat
>> [blah blah blah]
>>
>> [hod] (marco) >> run dfs -cat str-tabs.out/part-00000
>> a       4
>> a       1
>> b       3
>
> Compare to this negative-test:
>> [hod] (marco) >> run dfs -fs local -cat str-notabs
>> a 1
>> b 3
>> a 4
>>
>> [hod] (marco) >> run dfs -put str-notabs str-notabs
>>
>> [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs - 
>> output str-notabs.out -mapper /bin/cat -reducer /bin/cat
>> [blah blah blah]
>>
>> [hod] (marco) >> run dfs -cat str-notabs.out/part-00000
>> a 1
>> a 4
>> b 3
>>
>
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message