hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: HBase Types: Explicit Null Support
Date Tue, 02 Apr 2013 06:33:43 GMT
Maybe if we can keep nullability separate from the 
serialization/deserialization, we can come up with a solution that 
works? We're able to essentially infer that a column is null based on 
its value being missing or empty. So if an iterator through the row key 
bytes could detect/indicate that, then an application could "infer" the 
value is null.

We're definitely planning on keeping byte[] accessors for use cases that 
need it. I'm curious on the geographic data case, though, could you use 
a fixed length long with a couple of new SQL built-ins to encode/decode 
the latitude/longitude?

On 04/01/2013 11:29 PM, Jesse Yates wrote:
> Actually, that isn't all that far-fetched of a format Matt - pretty common
> anytime anyone wants to do sortable lat/long (*cough* three letter agencies
> cough*).
>
> Wouldn't we get the same by providing a simple set of libraries (ala
> orderly + other HBase useful things) and then still give access to the
> underlying byte array? Perhaps a nullable key type in that lib makes sense
> if lots of people need it and it would be nice to have standard libraries
> so tools could interop much more easily.
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>
>
> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>
>> Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
>> interfaces should be to provide first-class support for custom user types
>> in addition to the standard ones included.  Part of the power of hbase's
>> plain byte[] keys is that users can concoct the perfect key for their data
>> type.  For example, I have a lot of geographic data where I interleave
>> latitude/longitude bits into a sortable 64 bit value that would probably
>> never be included in a standard library.
>>
>>
>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis.soz@gmail.com> wrote:
>>
>>> I think having Int32, and NullableInt32 would support minimum overhead,
>> as
>>> well as allowing SQL semantics.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>>>
>>>> Furthermore, is is more important to support null values than squeeze
>> all
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimiduk@gmail.com> wrote:
>>>>
>>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>> wrote:
>>>>>
>>>>>>  From the SQL perspective, handling null is important.
>>>>>
>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>> expense of fixed-width encodings at all or supporting representation
>>> of a
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>> than
>>>>> -2^31?
>>>>>
>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>
>>>>>>> I'm thinking I will press forward with a base implementation
that
>>> does
>>>>>>> not
>>>>>>> support nulls. The idea is to provide an extensible set of
>>> interfaces,
>>>>>>> so I
>>>>>>> think this will not box us into a corner later. That is, a
>> mirroring
>>>>>>> package could be implemented that supports null values and accepts
>>>>>>> the relevant trade-offs.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>>
>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   I spent some time this weekend extracting bits of our
>> serialization
>>>>>>>> code to
>>>>>>>> a public github repo at http://github.com/hotpads/**data-tools<
>>>> http://github.com/hotpads/data-tools>
>>>>>>>> .
>>>>>>>>    Contributions are welcome - i'm sure we all have this
stuff
>> laying
>>>>>>>> around.
>>>>>>>>
>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>> *
>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
>>>>>>>> *
>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>>>>>>>> Looking back, I think my latest opinion on the topic is to
reject
>>>>>>>> nullability as the rule since it can cause unexpected behavior
and
>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>> LongArrayList
>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>> and
>>>>>>>> costs
>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>> wrapper
>>>>>>>> class, it's not very difficult for each user to provide their
own
>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>
>>>>>>>> If you reject nullability, the question becomes what to do
in
>>>> situations
>>>>>>>> where you're implementing existing interfaces that accept
nullable
>>>>>>>> params.
>>>>>>>>    The LongArrayList above implements List<Long> which
requires an
>>>>>>>> add(Long)
>>>>>>>> method.  In the above implementation I chose to swap nulls
with
>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force
the user
>>> to
>>>>>>>> make
>>>>>>>> that swap and then throw IllegalArgumentException if they
pass
>> null.
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>> HmmmŠ good question.
>>>>>>>>>
>>>>>>>>> I think that fixed width support is important for a great
many
>>> rowkey
>>>>>>>>> constructs cases, so I'd rather see something like losing
>> MIN_VALUE
>>>> and
>>>>>>>>> keeping fixed width.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimiduk@gmail.com>
wrote:
>>>>>>>>>
>>>>>>>>>   Heya,
>>>>>>>>>> Thinking about data types and serialization. I think
null
>> support
>>> is
>>>>>>>>>> an
>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>> especially
>>>>>>>>>> when considering the compound type. However, doing
so in
>> directly
>>>>>>>>>> incompatible with fixed-width representations for
numerics. For
>>>>>>>>>>
>>>>>>>>> instance,
>>>>>>>>> if we want to have a fixed-width signed long stored on
8-bytes,
>>> where
>>>>>>>>>> do
>>>>>>>>>> you put null? float and double types can cheat a
little by
>> folding
>>>>>>>>>> negative
>>>>>>>>>> and positive NaN's into a single representation (this
isn't
>>> strictly
>>>>>>>>>> correct!), leaving a place to represent null. In
the long
>> example
>>>>>>>>>> case,
>>>>>>>>>> the
>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase
MIN_VALUE by
>>> one.
>>>>>>>>>> This
>>>>>>>>>> will allocate an additional encoding which can be
used for null.
>>> My
>>>>>>>>>> experience working with scientific data, however,
makes me wince
>>> at
>>>>>>>>>> the
>>>>>>>>>> idea.
>>>>>>>>>>
>>>>>>>>>> The variable-width encodings have it a little easier.
There's
>>>> already
>>>>>>>>>> enough going on that it's simpler to make room.
>>>>>>>>>>
>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>> serialization.
>>>>>>>>>> This
>>>>>>>>>> imposes some limitations on our encoding strategies.
For
>> instance,
>>>>>>>>>> it's
>>>>>>>>>> not
>>>>>>>>>> enough to simply encode null, it really needs to
be encoded as
>>> 0x00
>>>> so
>>>>>>>>> as
>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>


Mime
View raw message