Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 6129 invoked from network); 20 Sep 2006 22:04:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 20 Sep 2006 22:04:04 -0000 Received: (qmail 58717 invoked by uid 500); 20 Sep 2006 22:04:04 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 58699 invoked by uid 500); 20 Sep 2006 22:04:04 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 58688 invoked by uid 99); 20 Sep 2006 22:04:04 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Sep 2006 15:04:04 -0700 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received: from [209.237.227.198] ([209.237.227.198:44749] helo=brutus.apache.org) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id FC/7F-01963-1DAB1154 for ; Wed, 20 Sep 2006 15:04:01 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 56F974190C7 for ; Wed, 20 Sep 2006 22:00:23 +0000 (GMT) Message-ID: <24277569.1158789623353.JavaMail.jira@brutus> Date: Wed, 20 Sep 2006 15:00:23 -0700 (PDT) From: "Bryan Pendleton (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-550) Text constructure can throw exception In-Reply-To: <6747780.1158701362340.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436368 ] Bryan Pendleton commented on HADOOP-550: ---------------------------------------- Ah, the bowels of String handling I hadn't uncovered yet.... Yes, I would support that, at least as the default. Perhaps an alternate constructor could be used that still does MalformedInput detection, if the developer prefers that. In my case, and in the case of strings passing through internal hadoop interfaces, I'd rather not see malformed string content causing (typically) pointless errors. > Text constructure can throw exception > ------------------------------------- > > Key: HADOOP-550 > URL: http://issues.apache.org/jira/browse/HADOOP-550 > Project: Hadoop > Issue Type: Bug > Reporter: Bryan Pendleton > > I finally got back around to moving my working code to using Text objects. > And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf. > However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet. > I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_ throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default). > Please, don't remove UTF8, for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira