Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 70806 invoked from network); 26 Jun 2008 19:54:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Jun 2008 19:54:38 -0000 Received: (qmail 95253 invoked by uid 500); 26 Jun 2008 19:54:38 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 95029 invoked by uid 500); 26 Jun 2008 19:54:37 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 95017 invoked by uid 99); 26 Jun 2008 19:54:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jun 2008 12:54:37 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jun 2008 19:53:55 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 19B7F234C14D for ; Thu, 26 Jun 2008 12:53:45 -0700 (PDT) Message-ID: <1366475745.1214510025090.JavaMail.jira@brutus> Date: Thu, 26 Jun 2008 12:53:45 -0700 (PDT) From: "Zheng Shao (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-3341) make key-value separators in hadoop streaming fully configurable In-Reply-To: <1444066639.1209758575610.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Shao updated HADOOP-3341: ------------------------------- Attachment: 3341-5.patch Allow multi-character separators in streaming map/reduce input/output. > make key-value separators in hadoop streaming fully configurable > ---------------------------------------------------------------- > > Key: HADOOP-3341 > URL: https://issues.apache.org/jira/browse/HADOOP-3341 > Project: Hadoop Core > Issue Type: Improvement > Components: contrib/streaming > Reporter: Zheng Shao > Assignee: Zheng Shao > Attachments: 3341-1.patch, 3341-2.patch, 3341-3.patch, 3341-4.patch, 3341-5.patch > > > By default, hadoop streaming uses TAB as the separator in all places. However in some environments, user may want to use customized separators (e.g, ^A = \u0001). > The separator logic in hadoop streaming is very convoluted. Here is a brief summary: > InputFormat { > KeyValueLineRecordReader.java:59: > S1: String sepStr = job.get("key.value.separator.in.input.line", "\t"); > } > Mapper { > PipeMapper.java:88: > S2: clientOut_.write('\t'); > Call mapper process > PipeMapRed.java:124: > S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t"); > PipeMapRed.java:128: > this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1); > } > Reducer { > PipeReducer.java:78: > S4: clientOut_.write('\t'); > Call reducer process > PipeMapRed.java:125: > S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t"); > PipeMapRed.java:129: > this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1); > } > OutputFormat { > TextOuputFormat.java:112: > S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t"); > } > Short-cuts: > 1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored). > 2. For jobs with no reducers, The "Reducer" step is skipped. > We need to make S3 and S4 configurable, possibly under the following names for conformity: > stream.map.input.field.separator > stream.reduce.input.field.separator > Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place! > Maybe hadoop streaming can also provide a single option to override these 6 options. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.