Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8CCF5F550 for ; Wed, 3 Apr 2013 18:29:46 +0000 (UTC) Received: (qmail 30895 invoked by uid 500); 3 Apr 2013 18:29:44 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 30845 invoked by uid 500); 3 Apr 2013 18:29:44 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 30831 invoked by uid 99); 3 Apr 2013 18:29:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2013 18:29:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dvryaboy@gmail.com designates 209.85.210.171 as permitted sender) Received: from [209.85.210.171] (HELO mail-ia0-f171.google.com) (209.85.210.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2013 18:29:40 +0000 Received: by mail-ia0-f171.google.com with SMTP id z13so1553162iaz.16 for ; Wed, 03 Apr 2013 11:29:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=jlxLn73jCs8vJYxLY4QBvLy9UwPO/Lx7AB2AErP//7A=; b=IS8dIwKGL7C3s74b9AMXNMKHIFCR3RuNe3gDnvuTo2CEv6PnE/nov22e3tHqcpHibw IVWFE1uWslFG6ZEe7vwR0t4TvU1HZrZpOyKj30vKhc03s/OGcKMo49BH8bbw8KiSAJgI yuFLnSj0EBThJJH9jnoRjyCZEdMqBgbZIkLgwean9YLYCKc1R6YvXb8ho78jbfiFkM17 KNLu9hE9wZKr53/LDpvFB9uloO/dXCnERje0W3AnK5O7bMbtL7Rrsf7X0Xz9EHbnE0h4 PYaNalCak3mOF7xia0rUalgHm/WCFIHWRUhDeymgYN0gFtpgUk0g2bfMpxx+cOmk9BDf 3mZw== MIME-Version: 1.0 X-Received: by 10.42.120.11 with SMTP id d11mr593038icr.55.1365013759776; Wed, 03 Apr 2013 11:29:19 -0700 (PDT) Received: by 10.50.242.104 with HTTP; Wed, 3 Apr 2013 11:29:19 -0700 (PDT) In-Reply-To: References: <515A18CB.3050601@salesforce.com> <515A7BC7.2020803@salesforce.com> Date: Wed, 3 Apr 2013 11:29:19 -0700 Message-ID: Subject: Re: HBase Types: Explicit Null Support From: Dmitriy Ryaboy To: user@hbase.apache.org Cc: dev@hbase.apache.org, pig-dev , hive-dev Content-Type: multipart/alternative; boundary=90e6ba6145b643de6104d9790816 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba6145b643de6104d9790816 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hiya Nick, Pig converts data for HBase storage using this class: https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoo= p/hbase/HBaseBinaryConverter.java(which is mostly just calling into HBase's Bytes class). As long as Bytes handles the null stuff, we'll just inherit the behavior. On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk wrote: > I agree that a user-extensible interface is a required feature here. > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's > keep in mind, though, that SQL and user applications are not the only > consumers of this interface. A big motivation is allowing interop with th= e > other higher MR languages. *cough* Where are my Pig and Hive peeps in thi= s > thread? > > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor >wrote: > > > Maybe if we can keep nullability separate from the > > serialization/deserialization, we can come up with a solution that work= s? > > We're able to essentially infer that a column is null based on its valu= e > > being missing or empty. So if an iterator through the row key bytes cou= ld > > detect/indicate that, then an application could "infer" the value is > null. > > > > We're definitely planning on keeping byte[] accessors for use cases tha= t > > need it. I'm curious on the geographic data case, though, could you use= a > > fixed length long with a couple of new SQL built-ins to encode/decode t= he > > latitude/longitude? > > > > > > On 04/01/2013 11:29 PM, Jesse Yates wrote: > > > >> Actually, that isn't all that far-fetched of a format Matt - pretty > common > >> anytime anyone wants to do sortable lat/long (*cough* three letter > >> agencies > >> cough*). > >> > >> Wouldn't we get the same by providing a simple set of libraries (ala > >> orderly + other HBase useful things) and then still give access to the > >> underlying byte array? Perhaps a nullable key type in that lib makes > sense > >> if lots of people need it and it would be nice to have standard > libraries > >> so tools could interop much more easily. > >> ------------------- > >> Jesse Yates > >> @jesse_yates > >> jyates.github.com > >> > >> > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan > wrote: > >> > >> Ah, I didn't even realize sql allowed null key parts. Maybe a goal o= f > >>> the > >>> interfaces should be to provide first-class support for custom user > types > >>> in addition to the standard ones included. Part of the power of > hbase's > >>> plain byte[] keys is that users can concoct the perfect key for their > >>> data > >>> type. For example, I have a lot of geographic data where I interleav= e > >>> latitude/longitude bits into a sortable 64 bit value that would > probably > >>> never be included in a standard library. > >>> > >>> > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis S=C3=B6ztutar > >>> wrote: > >>> > >>> I think having Int32, and NullableInt32 would support minimum > overhead, > >>>> > >>> as > >>> > >>>> well as allowing SQL semantics. > >>>> > >>>> > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk > >>>> wrote: > >>>> > >>>> Furthermore, is is more important to support null values than squee= ze > >>>>> > >>>> all > >>> > >>>> representations into minimum size (4-bytes for int32, &c.)? > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote: > >>>>> > >>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor < > jtaylor@salesforce.com > >>>>>> wrote: > >>>>>> > >>>>>> From the SQL perspective, handling null is important. > >>>>>>> > >>>>>> > >>>>>> From your perspective, it is critical to support NULLs, even at t= he > >>>>>> expense of fixed-width encodings at all or supporting representati= on > >>>>>> > >>>>> of a > >>>> > >>>>> full range of values. That is, you'd rather be able to represent NU= LL > >>>>>> > >>>>> than > >>>>> > >>>>>> -2^31? > >>>>>> > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote: > >>>>>> > >>>>>>> Thanks for the thoughtful response (and code!). > >>>>>>>> > >>>>>>>> I'm thinking I will press forward with a base implementation tha= t > >>>>>>>> > >>>>>>> does > >>>> > >>>>> not > >>>>>>>> support nulls. The idea is to provide an extensible set of > >>>>>>>> > >>>>>>> interfaces, > >>>> > >>>>> so I > >>>>>>>> think this will not box us into a corner later. That is, a > >>>>>>>> > >>>>>>> mirroring > >>> > >>>> package could be implemented that supports null values and accepts > >>>>>>>> the relevant trade-offs. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Nick > >>>>>>>> > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan > > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> I spent some time this weekend extracting bits of our > >>>>>>>> > >>>>>>> serialization > >>> > >>>> code to > >>>>>>>>> a public github repo at http://github.com/hotpads/****data-tool= s > > >>>>>>>>> < > >>>>>>>>> > >>>>>>>> http://github.com/hotpads/**data-tools< > http://github.com/hotpads/data-tools> > >>>>> > > >>>>> > >>>>>> . > >>>>>>>>> Contributions are welcome - i'm sure we all have this stuff > >>>>>>>>> > >>>>>>>> laying > >>> > >>>> around. > >>>>>>>>> > >>>>>>>>> You can see I've bumped into the NULL problem in a few places: > >>>>>>>>> * > >>>>>>>>> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**< > https://github.com/hotpads/**data-tools/blob/master/src/**> > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.** > >>>>>>>>> **java< > >>>>>>>>> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/** > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java< > https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpa= ds/data/primitive/lists/LongArrayList.java > > > >>> > >>>> * > >>>>>>>>> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**< > https://github.com/hotpads/**data-tools/blob/master/src/**> > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.**** > >>>>>>>>> java< > >>>>>>>>> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/** > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java< > https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpa= ds/data/types/floats/DoubleByteTool.java > > > >>> > >>>> Looking back, I think my latest opinion on the topic is to reject > >>>>>>>>> nullability as the rule since it can cause unexpected behavior > and > >>>>>>>>> confusion. It's cleaner to provide a wrapper class (so both > >>>>>>>>> LongArrayList > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavio= r, > >>>>>>>>> > >>>>>>>> and > >>>> > >>>>> costs > >>>>>>>>> a little more in performance. If the user can't find a pre-mad= e > >>>>>>>>> > >>>>>>>> wrapper > >>>>> > >>>>>> class, it's not very difficult for each user to provide their own > >>>>>>>>> interpretation of null and check for it themselves. > >>>>>>>>> > >>>>>>>>> If you reject nullability, the question becomes what to do in > >>>>>>>>> > >>>>>>>> situations > >>>>> > >>>>>> where you're implementing existing interfaces that accept nullabl= e > >>>>>>>>> params. > >>>>>>>>> The LongArrayList above implements List which requires > an > >>>>>>>>> add(Long) > >>>>>>>>> method. In the above implementation I chose to swap nulls with > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the > user > >>>>>>>>> > >>>>>>>> to > >>>> > >>>>> make > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass > >>>>>>>>> > >>>>>>>> null. > >>> > >>>> > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < > >>>>>>>>> doug.meil@explorysmedical.com > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> Hmmm=C5=A0 good question. > >>>>>>>>>> > >>>>>>>>>> I think that fixed width support is important for a great many > >>>>>>>>>> > >>>>>>>>> rowkey > >>>> > >>>>> constructs cases, so I'd rather see something like losing > >>>>>>>>>> > >>>>>>>>> MIN_VALUE > >>> > >>>> and > >>>>> > >>>>>> keeping fixed width. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" wrote: > >>>>>>>>>> > >>>>>>>>>> Heya, > >>>>>>>>>> > >>>>>>>>>>> Thinking about data types and serialization. I think null > >>>>>>>>>>> > >>>>>>>>>> support > >>> > >>>> is > >>>> > >>>>> an > >>>>>>>>>>> important characteristic for the serialized representations, > >>>>>>>>>>> especially > >>>>>>>>>>> when considering the compound type. However, doing so in > >>>>>>>>>>> > >>>>>>>>>> directly > >>> > >>>> incompatible with fixed-width representations for numerics. For > >>>>>>>>>>> > >>>>>>>>>>> instance, > >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes= , > >>>>>>>>>> > >>>>>>>>> where > >>>> > >>>>> do > >>>>>>>>>>> you put null? float and double types can cheat a little by > >>>>>>>>>>> > >>>>>>>>>> folding > >>> > >>>> negative > >>>>>>>>>>> and positive NaN's into a single representation (this isn't > >>>>>>>>>>> > >>>>>>>>>> strictly > >>>> > >>>>> correct!), leaving a place to represent null. In the long > >>>>>>>>>>> > >>>>>>>>>> example > >>> > >>>> case, > >>>>>>>>>>> the > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE b= y > >>>>>>>>>>> > >>>>>>>>>> one. > >>>> > >>>>> This > >>>>>>>>>>> will allocate an additional encoding which can be used for > null. > >>>>>>>>>>> > >>>>>>>>>> My > >>>> > >>>>> experience working with scientific data, however, makes me wince > >>>>>>>>>>> > >>>>>>>>>> at > >>>> > >>>>> the > >>>>>>>>>>> idea. > >>>>>>>>>>> > >>>>>>>>>>> The variable-width encodings have it a little easier. There's > >>>>>>>>>>> > >>>>>>>>>> already > >>>>> > >>>>>> enough going on that it's simpler to make room. > >>>>>>>>>>> > >>>>>>>>>>> Remember, the final goal is to support order-preserving > >>>>>>>>>>> > >>>>>>>>>> serialization. > >>>>> > >>>>>> This > >>>>>>>>>>> imposes some limitations on our encoding strategies. For > >>>>>>>>>>> > >>>>>>>>>> instance, > >>> > >>>> it's > >>>>>>>>>>> not > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded a= s > >>>>>>>>>>> > >>>>>>>>>> 0x00 > >>>> > >>>>> so > >>>>> > >>>>>> as > >>>>>>>>>> to sort lexicographically earlier than any other value. > >>>>>>>>>> > >>>>>>>>>>> What do you think? Any ideas, experiences, etc? > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Nick > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > > > --90e6ba6145b643de6104d9790816--