hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: HBase Types: Explicit Null Support
Date Tue, 02 Apr 2013 16:40:41 GMT
I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?

On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com>wrote:

> Maybe if we can keep nullability separate from the
> serialization/deserialization, we can come up with a solution that works?
> We're able to essentially infer that a column is null based on its value
> being missing or empty. So if an iterator through the row key bytes could
> detect/indicate that, then an application could "infer" the value is null.
>
> We're definitely planning on keeping byte[] accessors for use cases that
> need it. I'm curious on the geographic data case, though, could you use a
> fixed length long with a couple of new SQL built-ins to encode/decode the
> latitude/longitude?
>
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>> Actually, that isn't all that far-fetched of a format Matt - pretty common
>> anytime anyone wants to do sortable lat/long (*cough* three letter
>> agencies
>> cough*).
>>
>> Wouldn't we get the same by providing a simple set of libraries (ala
>> orderly + other HBase useful things) and then still give access to the
>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>> if lots of people need it and it would be nice to have standard libraries
>> so tools could interop much more easily.
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com
>>
>>
>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>>
>>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>> the
>>> interfaces should be to provide first-class support for custom user types
>>> in addition to the standard ones included.  Part of the power of hbase's
>>> plain byte[] keys is that users can concoct the perfect key for their
>>> data
>>> type.  For example, I have a lot of geographic data where I interleave
>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>> never be included in a standard library.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis.soz@gmail.com>
>>> wrote:
>>>
>>>  I think having Int32, and NullableInt32 would support minimum overhead,
>>>>
>>> as
>>>
>>>> well as allowing SQL semantics.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimiduk@gmail.com>
>>>> wrote:
>>>>
>>>>  Furthermore, is is more important to support null values than squeeze
>>>>>
>>>> all
>>>
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimiduk@gmail.com> wrote:
>>>>>
>>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   From the SQL perspective, handling null is important.
>>>>>>>
>>>>>>
>>>>>>  From your perspective, it is critical to support NULLs, even at
the
>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>
>>>>> of a
>>>>
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>
>>>>> than
>>>>>
>>>>>> -2^31?
>>>>>>
>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>
>>>>>>>> I'm thinking I will press forward with a base implementation
that
>>>>>>>>
>>>>>>> does
>>>>
>>>>>  not
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>
>>>>>>> interfaces,
>>>>
>>>>>  so I
>>>>>>>> think this will not box us into a corner later. That is,
a
>>>>>>>>
>>>>>>> mirroring
>>>
>>>>  package could be implemented that supports null values and accepts
>>>>>>>> the relevant trade-offs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I spent some time this weekend extracting bits of our
>>>>>>>>
>>>>>>> serialization
>>>
>>>>  code to
>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>>
>>>>>>  .
>>>>>>>>>    Contributions are welcome - i'm sure we all have this
stuff
>>>>>>>>>
>>>>>>>> laying
>>>
>>>>  around.
>>>>>>>>>
>>>>>>>>> You can see I've bumped into the NULL problem in a few
places:
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>> **java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>
>>>>  *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>> java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>>>  Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>> nullability as the rule since it can cause unexpected
behavior and
>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so
both
>>>>>>>>> LongArrayList
>>>>>>>>> plus NullableLongArrayList) that explicitly defines the
behavior,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>>  costs
>>>>>>>>> a little more in performance.  If the user can't find
a pre-made
>>>>>>>>>
>>>>>>>> wrapper
>>>>>
>>>>>>  class, it's not very difficult for each user to provide their own
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>
>>>>>>>>> If you reject nullability, the question becomes what
to do in
>>>>>>>>>
>>>>>>>> situations
>>>>>
>>>>>>  where you're implementing existing interfaces that accept nullable
>>>>>>>>> params.
>>>>>>>>>    The LongArrayList above implements List<Long>
which requires an
>>>>>>>>> add(Long)
>>>>>>>>> method.  In the above implementation I chose to swap
nulls with
>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force
the user
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>>  make
>>>>>>>>> that swap and then throw IllegalArgumentException if
they pass
>>>>>>>>>
>>>>>>>> null.
>>>
>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>
>>>>>>>>>> I think that fixed width support is important for
a great many
>>>>>>>>>>
>>>>>>>>> rowkey
>>>>
>>>>>  constructs cases, so I'd rather see something like losing
>>>>>>>>>>
>>>>>>>>> MIN_VALUE
>>>
>>>> and
>>>>>
>>>>>>  keeping fixed width.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimiduk@gmail.com>
wrote:
>>>>>>>>>>
>>>>>>>>>>   Heya,
>>>>>>>>>>
>>>>>>>>>>> Thinking about data types and serialization.
I think null
>>>>>>>>>>>
>>>>>>>>>> support
>>>
>>>> is
>>>>
>>>>>  an
>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>> especially
>>>>>>>>>>> when considering the compound type. However,
doing so in
>>>>>>>>>>>
>>>>>>>>>> directly
>>>
>>>>  incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>
>>>>>>>>>>>  instance,
>>>>>>>>>> if we want to have a fixed-width signed long stored
on 8-bytes,
>>>>>>>>>>
>>>>>>>>> where
>>>>
>>>>>  do
>>>>>>>>>>> you put null? float and double types can cheat
a little by
>>>>>>>>>>>
>>>>>>>>>> folding
>>>
>>>>  negative
>>>>>>>>>>> and positive NaN's into a single representation
(this isn't
>>>>>>>>>>>
>>>>>>>>>> strictly
>>>>
>>>>>  correct!), leaving a place to represent null. In the long
>>>>>>>>>>>
>>>>>>>>>> example
>>>
>>>>  case,
>>>>>>>>>>> the
>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase
MIN_VALUE by
>>>>>>>>>>>
>>>>>>>>>> one.
>>>>
>>>>>  This
>>>>>>>>>>> will allocate an additional encoding which can
be used for null.
>>>>>>>>>>>
>>>>>>>>>> My
>>>>
>>>>>  experience working with scientific data, however, makes me wince
>>>>>>>>>>>
>>>>>>>>>> at
>>>>
>>>>>  the
>>>>>>>>>>> idea.
>>>>>>>>>>>
>>>>>>>>>>> The variable-width encodings have it a little
easier. There's
>>>>>>>>>>>
>>>>>>>>>> already
>>>>>
>>>>>>  enough going on that it's simpler to make room.
>>>>>>>>>>>
>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>
>>>>>>>>>> serialization.
>>>>>
>>>>>>  This
>>>>>>>>>>> imposes some limitations on our encoding strategies.
For
>>>>>>>>>>>
>>>>>>>>>> instance,
>>>
>>>>  it's
>>>>>>>>>>> not
>>>>>>>>>>> enough to simply encode null, it really needs
to be encoded as
>>>>>>>>>>>
>>>>>>>>>> 0x00
>>>>
>>>>> so
>>>>>
>>>>>>  as
>>>>>>>>>> to sort lexicographically earlier than any other
value.
>>>>>>>>>>
>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message