Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3E7FF36B for ; Tue, 2 Apr 2013 06:17:48 +0000 (UTC) Received: (qmail 91816 invoked by uid 500); 2 Apr 2013 06:17:45 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 91550 invoked by uid 500); 2 Apr 2013 06:17:45 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 91507 invoked by uid 99); 2 Apr 2013 06:17:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2013 06:17:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mcorgan@hotpads.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bk0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2013 06:17:38 +0000 Received: by mail-bk0-f44.google.com with SMTP id jk13so32432bkc.3 for ; Mon, 01 Apr 2013 23:17:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=ivyycXLO6uovcO1TAKgBeQthJY80No4Ir+60mU6Ovc4=; b=mhNpl1JtI8KiFWVCKR6JrJM5f+JUEDHyREihRSMahq3hUFxd3pi3QCZMNHnn7wk2Xm wnpNdosTVnzF3dksDcm8CW+Gv3YkV5HcQKOPj2T0W9kH38jqyEYsTuq9aEl4ZI5v/f4L LMZhjYdHp+43KOchX1UOGH6eRh5ULcsmv/JK0b/0OHi1jl0ot8f86KpBwqNrGG47bzMm 4gMbRU/22zWX0FXZtVJOji2q/NFnj+IYuMdiUs7jNA8lqyZgUIvng+tokWhKbilAlTTG rFTt0xgbAgGpEbygjY0LnHrEtPP/pBbsZYIg0cvm+aXuJvRIKxb5YY2YjUWpGwcPDnRM P0Qg== MIME-Version: 1.0 X-Received: by 10.205.11.194 with SMTP id pf2mr6195927bkb.46.1364883437079; Mon, 01 Apr 2013 23:17:17 -0700 (PDT) Received: by 10.204.228.142 with HTTP; Mon, 1 Apr 2013 23:17:16 -0700 (PDT) In-Reply-To: References: <515A18CB.3050601@salesforce.com> Date: Mon, 1 Apr 2013 23:17:16 -0700 Message-ID: Subject: Re: HBase Types: Explicit Null Support From: Matt Corgan To: dev Cc: hbase-user Content-Type: multipart/alternative; boundary=20cf301cbe0a6d8f5a04d95ab082 X-Gm-Message-State: ALoCoQmwQLE2D935SIjcSPKIIs24h5mSrOl+LAOoKbnrvR6NhCrz0hfr+q9UrsAj/NyKqwBrwgN+ X-Virus-Checked: Checked by ClamAV on apache.org --20cf301cbe0a6d8f5a04d95ab082 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ah, I didn't even realize sql allowed null key parts. Maybe a goal of the interfaces should be to provide first-class support for custom user types in addition to the standard ones included. Part of the power of hbase's plain byte[] keys is that users can concoct the perfect key for their data type. For example, I have a lot of geographic data where I interleave latitude/longitude bits into a sortable 64 bit value that would probably never be included in a standard library. On Mon, Apr 1, 2013 at 8:38 PM, Enis S=C3=B6ztutar wro= te: > I think having Int32, and NullableInt32 would support minimum overhead, a= s > well as allowing SQL semantics. > > > On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk wrote: > > > Furthermore, is is more important to support null values than squeeze a= ll > > representations into minimum size (4-bytes for int32, &c.)? > > On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote: > > > > > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor > >wrote: > > > > > >> From the SQL perspective, handling null is important. > > > > > > > > > From your perspective, it is critical to support NULLs, even at the > > > expense of fixed-width encodings at all or supporting representation > of a > > > full range of values. That is, you'd rather be able to represent NULL > > than > > > -2^31? > > > > > > On 04/01/2013 01:32 PM, Nick Dimiduk wrote: > > >> > > >>> Thanks for the thoughtful response (and code!). > > >>> > > >>> I'm thinking I will press forward with a base implementation that > does > > >>> not > > >>> support nulls. The idea is to provide an extensible set of > interfaces, > > >>> so I > > >>> think this will not box us into a corner later. That is, a mirrorin= g > > >>> package could be implemented that supports null values and accepts > > >>> the relevant trade-offs. > > >>> > > >>> Thanks, > > >>> Nick > > >>> > > >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan > > >>> wrote: > > >>> > > >>> I spent some time this weekend extracting bits of our serializatio= n > > >>>> code to > > >>>> a public github repo at http://github.com/hotpads/**data-tools< > > http://github.com/hotpads/data-tools> > > >>>> . > > >>>> Contributions are welcome - i'm sure we all have this stuff layi= ng > > >>>> around. > > >>>> > > >>>> You can see I've bumped into the NULL problem in a few places: > > >>>> * > > >>>> > > >>>> https://github.com/hotpads/**data-tools/blob/master/src/** > > >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java< > > > https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpa= ds/data/primitive/lists/LongArrayList.java > > > > > >>>> * > > >>>> > > >>>> https://github.com/hotpads/**data-tools/blob/master/src/** > > >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java< > > > https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpa= ds/data/types/floats/DoubleByteTool.java > > > > > >>>> > > >>>> Looking back, I think my latest opinion on the topic is to reject > > >>>> nullability as the rule since it can cause unexpected behavior and > > >>>> confusion. It's cleaner to provide a wrapper class (so both > > >>>> LongArrayList > > >>>> plus NullableLongArrayList) that explicitly defines the behavior, > and > > >>>> costs > > >>>> a little more in performance. If the user can't find a pre-made > > wrapper > > >>>> class, it's not very difficult for each user to provide their own > > >>>> interpretation of null and check for it themselves. > > >>>> > > >>>> If you reject nullability, the question becomes what to do in > > situations > > >>>> where you're implementing existing interfaces that accept nullable > > >>>> params. > > >>>> The LongArrayList above implements List which requires an > > >>>> add(Long) > > >>>> method. In the above implementation I chose to swap nulls with > > >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user > to > > >>>> make > > >>>> that swap and then throw IllegalArgumentException if they pass nul= l. > > >>>> > > >>>> > > >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < > > >>>> doug.meil@explorysmedical.com > > >>>> > > >>>>> wrote: > > >>>>> Hmmm=C5=A0 good question. > > >>>>> > > >>>>> I think that fixed width support is important for a great many > rowkey > > >>>>> constructs cases, so I'd rather see something like losing MIN_VAL= UE > > and > > >>>>> keeping fixed width. > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" wrote: > > >>>>> > > >>>>> Heya, > > >>>>>> > > >>>>>> Thinking about data types and serialization. I think null suppor= t > is > > >>>>>> an > > >>>>>> important characteristic for the serialized representations, > > >>>>>> especially > > >>>>>> when considering the compound type. However, doing so in directl= y > > >>>>>> incompatible with fixed-width representations for numerics. For > > >>>>>> > > >>>>> instance, > > >>>> > > >>>>> if we want to have a fixed-width signed long stored on 8-bytes, > where > > >>>>>> do > > >>>>>> you put null? float and double types can cheat a little by foldi= ng > > >>>>>> negative > > >>>>>> and positive NaN's into a single representation (this isn't > strictly > > >>>>>> correct!), leaving a place to represent null. In the long exampl= e > > >>>>>> case, > > >>>>>> the > > >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by > one. > > >>>>>> This > > >>>>>> will allocate an additional encoding which can be used for null. > My > > >>>>>> experience working with scientific data, however, makes me wince > at > > >>>>>> the > > >>>>>> idea. > > >>>>>> > > >>>>>> The variable-width encodings have it a little easier. There's > > already > > >>>>>> enough going on that it's simpler to make room. > > >>>>>> > > >>>>>> Remember, the final goal is to support order-preserving > > serialization. > > >>>>>> This > > >>>>>> imposes some limitations on our encoding strategies. For instanc= e, > > >>>>>> it's > > >>>>>> not > > >>>>>> enough to simply encode null, it really needs to be encoded as > 0x00 > > so > > >>>>>> > > >>>>> as > > >>>> > > >>>>> to sort lexicographically earlier than any other value. > > >>>>>> > > >>>>>> What do you think? Any ideas, experiences, etc? > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Nick > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >> > > > > > > --20cf301cbe0a6d8f5a04d95ab082--