Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 5282 invoked from network); 4 Oct 2006 16:08:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 4 Oct 2006 16:08:43 -0000 Received: (qmail 73477 invoked by uid 500); 4 Oct 2006 16:08:43 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 73447 invoked by uid 500); 4 Oct 2006 16:08:42 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 73438 invoked by uid 99); 4 Oct 2006 16:08:42 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Oct 2006 09:08:42 -0700 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received: from [209.237.227.198] ([209.237.227.198:60818] helo=brutus.apache.org) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id 49/41-17684-98CD3254 for ; Wed, 04 Oct 2006 09:08:41 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 104397142D6 for ; Wed, 4 Oct 2006 09:08:26 -0700 (PDT) Message-ID: <22374204.1159978106063.JavaMail.root@brutus> Date: Wed, 4 Oct 2006 09:08:26 -0700 (PDT) From: "Mahadev konar (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-550) Text constructure can throw exception In-Reply-To: <6747780.1158701362340.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-550?page=all ] Mahadev konar updated HADOOP-550: --------------------------------- Attachment: text_junit.patch I have added a junit test and also made the set method throw runtimeexception. The junit test takes a static byte array of non utf8 characters and creates a text object and then get the bytes back using text.getBytes() and compares these bytes. Since we are not replcaing non UTF8 characters then I just compare to see if the initial and the final byte array are the same. > Text constructure can throw exception > ------------------------------------- > > Key: HADOOP-550 > URL: http://issues.apache.org/jira/browse/HADOOP-550 > Project: Hadoop > Issue Type: Bug > Affects Versions: 0.6.2 > Reporter: Bryan Pendleton > Assigned To: Hairong Kuang > Fix For: 0.7.0 > > Attachments: text.patch, text_junit.patch > > > I finally got back around to moving my working code to using Text objects. > And, once again, switching to Text (from UTF8) means my jobs are failing. This time, its better defined - constructing a Text from a string extracted from Real World data makes the Text object constructor throw a CharacterCodingException. This may be legit - I don't actually understand UTF well enough to understand what's wrong with the supplied string. I'm assembling a series of strings, some of which are user-supplied, and something causes the Text constructor to barf. > However, this is still completely unacceptable. If I need to stuff textual data someplace - I need the container to *do* it. If user-supplied inputs can't be stored as a "UTF" aware text value, then another container needs to be brought into existence. Sure, I can use a BytesWritable, but, as its name implies - Text should handle "text". If Text is supposed to == "StringWritable", then, well, it doesn't, yet. > I admit to being a few weeks' back in the bleeding edge at this point, so maybe my particluar Text bug has been fixed, though the only fixes to Text I see are adopting it into more of the internals of Hadoop. This argument goes double in that case - if we're using Text objects internally, it should really be a totally solid object - construct one from a String, get one back, but _never_ throw a content-related Exception. Or, if Text is not the right object because its data-sensitive, then I argue we shouldn't use it in any case where data might kill it - internal, or anywhere else (by default). > Please, don't remove UTF8, for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira