Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 43478 invoked from network); 23 Sep 2008 23:38:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Sep 2008 23:38:42 -0000 Received: (qmail 79440 invoked by uid 500); 23 Sep 2008 23:38:35 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 79395 invoked by uid 500); 23 Sep 2008 23:38:35 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 79369 invoked by uid 99); 23 Sep 2008 23:38:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Sep 2008 16:38:35 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Sep 2008 23:37:43 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 60AE6234C224 for ; Tue, 23 Sep 2008 16:37:47 -0700 (PDT) Message-ID: <1116259856.1222213067395.JavaMail.jira@brutus> Date: Tue, 23 Sep 2008 16:37:47 -0700 (PDT) From: "Abdul Qadeer (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4182) Streaming Documentation Update In-Reply-To: <1145328412.1221543464654.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633973#action_12633973 ] Abdul Qadeer commented on HADOOP-4182: -------------------------------------- I agree with you that it is a problem at the application / user level. I only wanted to put a simple comment somewhere on the Hadoop Wiki that says that a line must end with an end of line delimiter. If not, user might get different behaviors as I explained earlier. This simple comment can keep a user from accidental un-expected results. > Streaming Documentation Update > ------------------------------ > > Key: HADOOP-4182 > URL: https://issues.apache.org/jira/browse/HADOOP-4182 > Project: Hadoop Core > Issue Type: Improvement > Components: contrib/streaming > Affects Versions: 0.19.0 > Reporter: Abdul Qadeer > Priority: Minor > Fix For: 0.19.0 > > > When Text input data is used with streaming, every line is expected to end with a newline. Hadoop results are undefined if input files do not end in a newline. (The results will depend on how files are assigned to mappers.) > Example: > In streaming if > mapper = xargs cat > reducer = cat > and the input is a two line, where each line is symbolic link in HDFS > link1\n > link2\n > EOF > link1 points to a file which contains > This is line1EOF > link2 points to a file which contains > This is line2EOF > Now running a streaming job such that, there is only one split, will produce results: > This is line1This is line2\t\n > But if there were two splits, the result will be > This is line1\t\n > This is line2\t\n > So in summary, the output depends on the factor that how many mappers were invoked. As a caution, it should be recorded in Streaming wiki that users always put a new line at the end of each line to get away with such problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.