Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 77551 invoked from network); 28 Sep 2006 20:32:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Sep 2006 20:32:09 -0000 Received: (qmail 62007 invoked by uid 500); 28 Sep 2006 20:32:05 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 61946 invoked by uid 500); 28 Sep 2006 20:31:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 61925 invoked by uid 99); 28 Sep 2006 20:31:59 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Sep 2006 13:31:58 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 248347142A5 for ; Thu, 28 Sep 2006 20:26:58 +0000 (GMT) Message-ID: <32583231.1159475218147.JavaMail.jira@brutus> Date: Thu, 28 Sep 2006 13:26:58 -0700 (PDT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-550) Text constructure can throw exception In-Reply-To: <6747780.1158701362340.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12438547 ] Doug Cutting commented on HADOOP-550: ------------------------------------- Two minor nits: 1. Instead of ignoring the CharacterCodingException that should never be thrown, it would be better to throw a RuntimeException. That way, if it ever does happen, we'll know. 2. It would be good to add unit tests that creates a Text using invalid UTF-8, i.e., random binary data, and check that various methods work as expected. > Text constructure can throw exception > ------------------------------------- > > Key: HADOOP-550 > URL: http://issues.apache.org/jira/browse/HADOOP-550 > Project: Hadoop > Issue Type: Bug > Affects Versions: 0.6.2 > Reporter: Bryan Pendleton > Assigned To: Hairong Kuang > Fix For: 0.7.0 > > Attachments: text.patch > > > I finally got back around to moving my working code to using Text objects. > And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf. > However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet. > I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_ throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default). > Please, don't remove UTF8, for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira