hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-1722) Make streaming to handle non-utf8 byte array
Date Tue, 13 Nov 2007 23:15:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542256
] 

owen.omalley edited comment on HADOOP-1722 at 11/13/07 3:14 PM:
-----------------------------------------------------------------

I think the right way to handle this is to support a standard quoting language on input and
output from each streaming process. In particular, I think that streaming should have:

tab = field separator
new line = record separator
\t = literal tab
\n = literal newline
\ \ = literal backslash

all other bytes (not characters!) including non-ascii and non-utf8 are passed literally through.
Quoting is done on the stdin of the process and unquoting is done on the stdout of the process.
This would make it very easy to write arbitrary binary values to the framework from streaming.

Thoughts?

      was (Author: owen.omalley):
    I think the right way to handle this is to support a standard quoting language on input
and output from each streaming process. In particular, I think that streaming should have:

tab = field separator
new line = record separator
\t = literal tab
\n = literal newline
\ \ = literal backquote

all other bytes (not characters!) including non-ascii and non-utf8 are passed literally through.
Quoting is done on the stdin of the process and unquoting is done on the stdout of the process.
This would make it very easy to write arbitrary binary values to the framework from streaming.

Thoughts?
  
> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Christopher Zimmerman
>
> Right now, the streaming framework expects the output sof the steam process (mapper or
reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs
may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome this limit
by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values,

> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding protocol, 
> they can output arabitary bytearray and the streaming framework can handle them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message