Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 24474 invoked from network); 10 Jul 2006 19:00:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 10 Jul 2006 19:00:28 -0000 Received: (qmail 76594 invoked by uid 500); 10 Jul 2006 19:00:27 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 76572 invoked by uid 500); 10 Jul 2006 19:00:27 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 76563 invoked by uid 99); 10 Jul 2006 19:00:27 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jul 2006 12:00:27 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jul 2006 12:00:26 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2986941048D for ; Mon, 10 Jul 2006 18:58:31 +0000 (GMT) Message-ID: <11937808.1152557911165.JavaMail.jira@brutus> Date: Mon, 10 Jul 2006 18:58:31 +0000 (GMT+00:00) From: "Hairong Kuang (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136 In-Reply-To: <26017675.1150245149961.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420149 ] Hairong Kuang commented on HADOOP-302: -------------------------------------- If we use the recordio scheme, we need to extend it so that it can read a variable-length integer from a byte array. This is for the support of byte-wise comparison. > class Text (replacement for class UTF8) was: HADOOP-136 > ------------------------------------------------------- > > Key: HADOOP-302 > URL: http://issues.apache.org/jira/browse/HADOOP-302 > Project: Hadoop > Type: Improvement > Components: io > Reporter: Michel Tourn > Assignee: Hairong Kuang > > Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8) > a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or > b) the record-IO scheme in o.a.h.record.Utils.java:readInt > Either way, note that: > 1. UTF8.java and its successor Text.java need to read the length in two ways: > 1a. consume 1+ bytes from a DataInput and > 1b. parse the length within a byte array at a given offset > (1.b is used for the "WritableComparator optimized for UTF8 keys" ). > o.a.h.record.Utils only supports the DataInput mode. > It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes > 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation. > For the byte array case, the varlen-reader utility needs to be extended to return both: > the decoded length and the length of the encoded length. > (so that the caller can do offset += encodedlength) > > 3. A String length does not need (small) negative integers. > 4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira