Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23E391151A for ; Tue, 13 May 2014 23:49:50 +0000 (UTC) Received: (qmail 57701 invoked by uid 500); 13 May 2014 23:21:49 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 57612 invoked by uid 500); 13 May 2014 23:21:49 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Delivered-To: moderator for dev@hbase.apache.org Received: (qmail 96858 invoked by uid 99); 13 May 2014 22:59:14 -0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rblue@cloudera.com designates 209.85.220.45 as permitted sender) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=Ff8qWfolmr3rhtYhpox/i2q78eCEihyMVgUbOQhZyow=; b=k3L+GeM0T6NBF3RVaGaFnIvw/xe6dK9jIpBclKOlHWoo5tFzZaHzbIDQXugv3l6fBH FKKNF44DnK8x+ShAiG0mcfhNjzVL7jFvVUu1H7xLN7E5oBcziwWj5kzWNxJ2PbHdAiXg p/6mCvsCJ8Lk3ghwlnZTOuC5KWqDVmTA1nptJ/p8vaqoR6N74Gb1i0Dh+c28NbRWFNFo pA0isFPvWZFgUVVgKRXqnDgXr4WhWsz+CSynM7jXtqd2TnzJv4IB6sO9DGR/fVa0i0UB 8ho0Ve5Goh8w9MvqACkaC3boznA7kYDgDWN5aCyDoxyzeKK8ya/fZ53z1QjAW1IBSN85 c4IQ== X-Gm-Message-State: ALoCoQksKBsB+vziBvTe0m2HYU6dLEU+H+lC+wIwZbasLbmDljL3wdETGEDWq/F7G3RkSkgcs8Ay X-Received: by 10.66.156.34 with SMTP id wb2mr60016632pab.83.1400021927791; Tue, 13 May 2014 15:58:47 -0700 (PDT) Message-ID: <5372A3A4.6080805@cloudera.com> Date: Tue, 13 May 2014 15:58:44 -0700 From: Ryan Blue User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Nick Dimiduk , hbase-dev CC: "jamestaylor@apache.org" Subject: Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Here are a few more specific responses. Hopefully this clears up some remaining points in the context of my last post. > Why not use protobuf directly instead of reimplementing a slight > variation of their format? I intend to use protobuf directly for compound values. It isn't practical right now for keys because protobuf doesn't have value encodings that are memcmp, nor are its tags memcmp for fields > 16. > * memcmp encodings for primitives in cells desired for phoenix (2ndary > indices?) > > This sounds like a Phoenix-specific decision. I think it's okay for the spec to optimize for certain patterns. Using the memcmp encodings in primitive cells allows us to do value comparison on encoded bytes and speed up scans. I was under the impression that this is something Phoenix does to speed up results, so we included it. If we want to optimize for something else instead, what should we choose? > OrderedBytes implements a bit-shifting strategy for this. > {FixedLength,Terminated}Wrapper are provided to add flexibility. Ryan > has suggested a variation of run-length encoding as another alternative, > something we could add is there's sufficient need. We went with the run-length encoding variant because in most cases, it decreases the size of the data or doesn't increase it too much. It increases the size only when there are single null bytes, in which case it adds a byte for each single null. Size is the same or reduced with two or more null bytes. The reason for choosing this over the OB type is to support null bytes, and because OB adds ceil(size / 7) + 1 bytes to each value, and requires bit shifts to encode and decode. > * do we include 1 byte and 2 byte ints? > > Following the initial commit of HBASE-8201, these were requested HBASE-9369. +1 for small ints > The above date question is a perfece example of why I think it's > important that we have the DataType interface. Having the interface > means an application can implement it's own types when their needs are > too unique for commit to HBase. Other applications can still use that > implementation by including the relevant application jars. They enjoy > interoperability by agreeing on the DataType implementation, not on > something provided out of the box by a particular HBase version. I think this spec would be a stronger interop guarantee. We should discuss whether we can support this spec along with existing data, although I suspect we probably can't. rb -- Ryan Blue Software Engineer Cloudera, Inc.