Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 82170 invoked from network); 13 Nov 2007 23:16:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Nov 2007 23:16:05 -0000 Received: (qmail 44000 invoked by uid 500); 13 Nov 2007 23:15:52 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 43723 invoked by uid 500); 13 Nov 2007 23:15:51 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 43714 invoked by uid 99); 13 Nov 2007 23:15:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Nov 2007 15:15:51 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Nov 2007 23:16:03 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 65AC3714243 for ; Tue, 13 Nov 2007 15:15:43 -0800 (PST) Message-ID: <19237437.1194995743413.JavaMail.jira@brutus> Date: Tue, 13 Nov 2007 15:15:43 -0800 (PST) From: "Owen O'Malley (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Issue Comment Edited: (HADOOP-1722) Make streaming to handle non-utf8 byte array In-Reply-To: <21231009.1187274810621.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542256 ] owen.omalley edited comment on HADOOP-1722 at 11/13/07 3:14 PM: ----------------------------------------------------------------- I think the right way to handle this is to support a standard quoting language on input and output from each streaming process. In particular, I think that streaming should have: tab = field separator new line = record separator \t = literal tab \n = literal newline \ \ = literal backslash all other bytes (not characters!) including non-ascii and non-utf8 are passed literally through. Quoting is done on the stdin of the process and unquoting is done on the stdout of the process. This would make it very easy to write arbitrary binary values to the framework from streaming. Thoughts? was (Author: owen.omalley): I think the right way to handle this is to support a standard quoting language on input and output from each streaming process. In particular, I think that streaming should have: tab = field separator new line = record separator \t = literal tab \n = literal newline \ \ = literal backquote all other bytes (not characters!) including non-ascii and non-utf8 are passed literally through. Quoting is done on the stdin of the process and unquoting is done on the stdout of the process. This would make it very easy to write arbitrary binary values to the framework from streaming. Thoughts? > Make streaming to handle non-utf8 byte array > -------------------------------------------- > > Key: HADOOP-1722 > URL: https://issues.apache.org/jira/browse/HADOOP-1722 > Project: Hadoop > Issue Type: Improvement > Components: contrib/streaming > Reporter: Runping Qi > Assignee: Christopher Zimmerman > > Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line > oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8 > (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple > encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values, > the framework decodes them in the Java side. > This way, as long as the mapper/reducer executables follow this encoding protocol, > they can output arabitary bytearray and the streaming framework can handle them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.