Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1F72CF5F8 for ; Tue, 2 Apr 2013 09:34:24 +0000 (UTC) Received: (qmail 95011 invoked by uid 500); 2 Apr 2013 09:34:21 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 94807 invoked by uid 500); 2 Apr 2013 09:34:21 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 94763 invoked by uid 99); 2 Apr 2013 09:34:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2013 09:34:20 +0000 X-ASF-Spam-Status: No, hits=1.1 required=5.0 tests=DATE_IN_PAST_06_12,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.82 as permitted sender) Received: from [65.55.111.82] (HELO blu0-omc2-s7.blu0.hotmail.com) (65.55.111.82) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2013 09:34:13 +0000 Received: from BLU0-SMTP116 ([65.55.111.73]) by blu0-omc2-s7.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Tue, 2 Apr 2013 02:33:52 -0700 X-EIP: [Mk8i2yBx9Q0V8d8RArE35hRTjIo0r66VqB3kGcpiILI=] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [10.29.230.56] ([198.228.193.205]) by BLU0-SMTP116.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Tue, 2 Apr 2013 02:33:50 -0700 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Subject: Re: HBase Types: Explicit Null Support References: <515A18CB.3050601@salesforce.com> From: Michel Segel MIME-Version: 1.0 (1.0) In-Reply-To: Date: Mon, 1 Apr 2013 22:40:03 -0400 CC: hbase-dev , hbase-user To: "user@hbase.apache.org" X-Mailer: iPad Mail (10B329) X-OriginalArrivalTime: 02 Apr 2013 09:33:50.0912 (UTC) FILETIME=[2F2F7C00:01CE2F85] X-Virus-Checked: Checked by ClamAV on apache.org Silly question... Null support. In a system where a column may or may not exist, how do you su= pport null? ;-) In terms of a key, it's a primary key and can't be null. =20 So what am I missing? Sent from a remote device. Please excuse any typos... Mike Segel On Apr 1, 2013, at 10:26 PM, Nick Dimiduk wrote: > Furthermore, is is more important to support null values than squeeze all > representations into minimum size (4-bytes for int32, &c.)? > On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote: >=20 >> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor wrot= e: >>=20 >>> =46rom the SQL perspective, handling null is important. >>=20 >>=20 >> =46rom your perspective, it is critical to support NULLs, even at the >> expense of fixed-width encodings at all or supporting representation of a= >> full range of values. That is, you'd rather be able to represent NULL tha= n >> -2^31? >>=20 >> On 04/01/2013 01:32 PM, Nick Dimiduk wrote: >>>=20 >>>> Thanks for the thoughtful response (and code!). >>>>=20 >>>> I'm thinking I will press forward with a base implementation that does >>>> not >>>> support nulls. The idea is to provide an extensible set of interfaces, >>>> so I >>>> think this will not box us into a corner later. That is, a mirroring >>>> package could be implemented that supports null values and accepts >>>> the relevant trade-offs. >>>>=20 >>>> Thanks, >>>> Nick >>>>=20 >>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan >>>> wrote: >>>>=20 >>>> I spent some time this weekend extracting bits of our serialization >>>>> code to >>>>> a public github repo at http://github.com/hotpads/**data-tools >>>>> . >>>>> Contributions are welcome - i'm sure we all have this stuff laying >>>>> around. >>>>>=20 >>>>> You can see I've bumped into the NULL problem in a few places: >>>>> * >>>>>=20 >>>>> https://github.com/hotpads/**data-tools/blob/master/src/** >>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java >>>>> * >>>>>=20 >>>>> https://github.com/hotpads/**data-tools/blob/master/src/** >>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java >>>>>=20 >>>>> Looking back, I think my latest opinion on the topic is to reject >>>>> nullability as the rule since it can cause unexpected behavior and >>>>> confusion. It's cleaner to provide a wrapper class (so both >>>>> LongArrayList >>>>> plus NullableLongArrayList) that explicitly defines the behavior, and >>>>> costs >>>>> a little more in performance. If the user can't find a pre-made wrapp= er >>>>> class, it's not very difficult for each user to provide their own >>>>> interpretation of null and check for it themselves. >>>>>=20 >>>>> If you reject nullability, the question becomes what to do in situatio= ns >>>>> where you're implementing existing interfaces that accept nullable >>>>> params. >>>>> The LongArrayList above implements List which requires an >>>>> add(Long) >>>>> method. In the above implementation I chose to swap nulls with >>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to >>>>> make >>>>> that swap and then throw IllegalArgumentException if they pass null. >>>>>=20 >>>>>=20 >>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < >>>>> doug.meil@explorysmedical.com >>>>>=20 >>>>>> wrote: >>>>>> Hmmm=C5=A0 good question. >>>>>>=20 >>>>>> I think that fixed width support is important for a great many rowkey= >>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE a= nd >>>>>> keeping fixed width. >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" wrote: >>>>>>=20 >>>>>> Heya, >>>>>>>=20 >>>>>>> Thinking about data types and serialization. I think null support is= >>>>>>> an >>>>>>> important characteristic for the serialized representations, >>>>>>> especially >>>>>>> when considering the compound type. However, doing so in directly >>>>>>> incompatible with fixed-width representations for numerics. For >>>>>> instance, >>>>>=20 >>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where= >>>>>>> do >>>>>>> you put null? float and double types can cheat a little by folding >>>>>>> negative >>>>>>> and positive NaN's into a single representation (this isn't strictly= >>>>>>> correct!), leaving a place to represent null. In the long example >>>>>>> case, >>>>>>> the >>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. >>>>>>> This >>>>>>> will allocate an additional encoding which can be used for null. My >>>>>>> experience working with scientific data, however, makes me wince at >>>>>>> the >>>>>>> idea. >>>>>>>=20 >>>>>>> The variable-width encodings have it a little easier. There's alread= y >>>>>>> enough going on that it's simpler to make room. >>>>>>>=20 >>>>>>> Remember, the final goal is to support order-preserving serializatio= n. >>>>>>> This >>>>>>> imposes some limitations on our encoding strategies. For instance, >>>>>>> it's >>>>>>> not >>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 s= o >>>>>> as >>>>>=20 >>>>>> to sort lexicographically earlier than any other value. >>>>>>>=20 >>>>>>> What do you think? Any ideas, experiences, etc? >>>>>>>=20 >>>>>>> Thanks, >>>>>>> Nick >>=20