Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 59408 invoked from network); 15 Aug 2006 23:04:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 15 Aug 2006 23:04:12 -0000 Received: (qmail 84461 invoked by uid 500); 15 Aug 2006 23:04:11 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 84442 invoked by uid 500); 15 Aug 2006 23:04:11 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 84433 invoked by uid 99); 15 Aug 2006 23:04:11 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Aug 2006 16:04:11 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Aug 2006 16:04:10 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9113E410021 for ; Tue, 15 Aug 2006 23:01:16 +0000 (GMT) Message-ID: <14941383.1155682876591.JavaMail.jira@brutus> Date: Tue, 15 Aug 2006 16:01:16 -0700 (PDT) From: "Sameer Paranjpye (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-439) Streaming does not work for text data if the records don't fit in a short UTF8 [2^16/3 characters] In-Reply-To: <8502499.1155174613900.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-439?page=comments#action_12428258 ] Sameer Paranjpye commented on HADOOP-439: ----------------------------------------- This ought to be resolvable by replacing UTF8 by the new Text class. Streaming should use Text instead of UTF8 to represent strings. > Streaming does not work for text data if the records don't fit in a short UTF8 [2^16/3 characters] > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-439 > URL: http://issues.apache.org/jira/browse/HADOOP-439 > Project: Hadoop > Issue Type: Bug > Affects Versions: 0.5.0 > Reporter: Dick King > Priority: Critical > Fix For: 0.6.0 > > > The streaming code internally reads the input data into a UTF8 . This causes truncated data to be shipped to the mapper when the input exceeds about 21000 characters, with no notice to the user except possibly in individual tasks' machines' logs, which people would not normally read for apparently successful jobs. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira