Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 4676 invoked from network); 8 Jul 2006 18:54:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Jul 2006 18:54:24 -0000 Received: (qmail 10010 invoked by uid 500); 8 Jul 2006 18:54:24 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 9976 invoked by uid 500); 8 Jul 2006 18:54:24 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 9967 invoked by uid 99); 8 Jul 2006 18:54:23 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jul 2006 11:54:23 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jul 2006 11:54:23 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 256144104CA for ; Sat, 8 Jul 2006 18:52:31 +0000 (GMT) Message-ID: <17013372.1152384751150.JavaMail.jira@brutus> Date: Sat, 8 Jul 2006 18:52:31 +0000 (GMT+00:00) From: "eric baldeschwieler (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136 In-Reply-To: <26017675.1150245149961.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12419872 ] eric baldeschwieler commented on HADOOP-302: -------------------------------------------- +1 on doug's suggestion. Let's use real UTF8. Then we can interoperate with more things. Agreed that we need to use one of the existing variable length encodings. Inventing another would be counter productive. My preference would be to use the recordio scheme, since it is already in hadoop. If we choose to import the lucene version, we should consider using it for recordio too, easy to change now, since it is still new. > class Text (replacement for class UTF8) was: HADOOP-136 > ------------------------------------------------------- > > Key: HADOOP-302 > URL: http://issues.apache.org/jira/browse/HADOOP-302 > Project: Hadoop > Type: Improvement > Components: io > Reporter: Michel Tourn > Assignee: Hairong Kuang > > Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) > a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or > b) the record-IO scheme in o.a.h.record.Utils.java:readInt > Either way, note that: > 1. UTF8.java and its successor Text.java need to read the length in two ways: > 1a. consume 1+ bytes from a DataInput and > 1b. parse the length within a byte array at a given offset > (1.b is used for the "WritableComparator optimized for UTF8 keys" ). > o.a.h.record.Utils only supports the DataInput mode. > It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes > 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. > For the byte array case, the varlen-reader utility needs to be extended to return both: > the decoded length and the length of the encoded length. > (so that the caller can do offset += encodedlength) > > 3. A String length does not need (small) negative integers. > 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira