hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: HBase Types: Explicit Null Support
Date Fri, 05 Apr 2013 02:18:59 GMT
With Phoenix, variable width types may be null in all cases (in the row 
key or as key values) and fixed width types may be null as key values or 
as the last row key column. We only allow a binary type in the row key 
as the last column. We haven't had any push back on these restrictions 
to date.

Would it make sense to clean up the APIs a bit and post just the type 
system code somewhere to give us something to poke holes at?

Thanks,

     James

On 04/04/2013 06:49 PM, Nick Dimiduk wrote:
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com>wrote:
>
>> Maybe if we can keep nullability separate from the
>> serialization/deserialization, we can come up with a solution that works?
>
> I think implied null could work, but let's build out the matrix. I see two
> kinds of types: fixed- and variable-width. These types are used in two
> scenarios: on their own or as part of a compound type.
>
> A fixed-width type used standalone can enfer null from absence of a value.
> When used in a compound type, absence isn't enough to indicate null unless
> it's the last value in the sequence. To support a null field in the middle
> of the compound type, it is forced to explicitly mark the field as null.
> The only solution I can think of (without sacrificing the full value range,
> per my original question) is to write the full type width bytes, followed
> by an isNull byte. Thus, for example, the INT type consumes 4 bytes when
> serialized stand-alone, but 5 bytes when composed.
>
> James, how does Phoenix handle a null fixed-width rowkey component? I don't
> see that implemented in PDataType enum.
>
> Variable-width used standalone are simple enough because HBase handles
> arbitrary length byte[]'s everywhere. Variable-width in composite is a
> problem. Phoenix forces these value to only appear as the last position in
> the composite, as I understand it. Orderly provides explicit null and
> termination bytes by taking advantage of a feature of UTF-8 encoding.
> Support for bytes is equally ugly (but clever) in that byte digits are
> encoded in BCD. Both of these approaches bloat slightly the serialized
> representation over the natural representation, but they allow the
> variable-length types to be used anywhere within the compound type. As an
> added bonus regarding code maintainability, their serialization entirely
> self-contained within the type. That's in contrast to the fixed-width type
> implementation described above, where null is explicitly encoded by the
> compound type.
>
> My opinion is the computational and storage overhead imposed by Orderly's
> implementation are worth the trade-off in flexibility in user consumption.
> Correct me if i'm wrong James, but you're saying, from your experience with
> Phoenix, users are willing to work within that constraint?
>
> Thanks,
> Nick
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>   Actually, that isn't all that far-fetched of a format Matt - pretty common
>>> anytime anyone wants to do sortable lat/long (*cough* three letter
>>> agencies
>>> cough*).
>>>
>>> Wouldn't we get the same by providing a simple set of libraries (ala
>>> orderly + other HBase useful things) and then still give access to the
>>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>>> if lots of people need it and it would be nice to have standard libraries
>>> so tools could interop much more easily.
>>> -------------------
>>> Jesse Yates
>>> @jesse_yates
>>> jyates.github.com
>>>
>>>
>>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>>>
>>>   Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>>> the
>>>> interfaces should be to provide first-class support for custom user types
>>>> in addition to the standard ones included.  Part of the power of hbase's
>>>> plain byte[] keys is that users can concoct the perfect key for their
>>>> data
>>>> type.  For example, I have a lot of geographic data where I interleave
>>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>>> never be included in a standard library.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis.soz@gmail.com>
>>>> wrote:
>>>>
>>>>   I think having Int32, and NullableInt32 would support minimum overhead,
>>>> as
>>>>
>>>>> well as allowing SQL semantics.
>>>>>
>>>>>
>>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimiduk@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   Furthermore, is is more important to support null values than squeeze
>>>>> all
>>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimiduk@gmail.com>
wrote:
>>>>>>
>>>>>>   On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>    From the SQL perspective, handling null is important.
>>>>>>>   From your perspective, it is critical to support NULLs, even
at the
>>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>>
>>>>>> of a
>>>>>> full range of values. That is, you'd rather be able to represent
NULL
>>>>>> than
>>>>>>
>>>>>>> -2^31?
>>>>>>>
>>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>>
>>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>> I'm thinking I will press forward with a base implementation
that
>>>>>>>>>
>>>>>>>> does
>>>>>>   not
>>>>>>>>> support nulls. The idea is to provide an extensible set
of
>>>>>>>>>
>>>>>>>> interfaces,
>>>>>>   so I
>>>>>>>>> think this will not box us into a corner later. That
is, a
>>>>>>>>>
>>>>>>>> mirroring
>>>>>   package could be implemented that supports null values and accepts
>>>>>>>>> the relevant trade-offs.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>    I spent some time this weekend extracting bits of
our
>>>>>>>>>
>>>>>>>> serialization
>>>>>   code to
>>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>>>>   .
>>>>>>>>>>     Contributions are welcome - i'm sure we all have
this stuff
>>>>>>>>>>
>>>>>>>>> laying
>>>>>   around.
>>>>>>>>>> You can see I've bumped into the NULL problem in
a few places:
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>>> **java<
>>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>
>>>>>   *
>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>>> java<
>>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>
>>>>>   Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>>> nullability as the rule since it can cause unexpected
behavior and
>>>>>>>>>> confusion.  It's cleaner to provide a wrapper class
(so both
>>>>>>>>>> LongArrayList
>>>>>>>>>> plus NullableLongArrayList) that explicitly defines
the behavior,
>>>>>>>>>>
>>>>>>>>> and
>>>>>>   costs
>>>>>>>>>> a little more in performance.  If the user can't
find a pre-made
>>>>>>>>>>
>>>>>>>>> wrapper
>>>>>>>   class, it's not very difficult for each user to provide their
own
>>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>>
>>>>>>>>>> If you reject nullability, the question becomes what
to do in
>>>>>>>>>>
>>>>>>>>> situations
>>>>>>>   where you're implementing existing interfaces that accept nullable
>>>>>>>>>> params.
>>>>>>>>>>     The LongArrayList above implements List<Long>
which requires an
>>>>>>>>>> add(Long)
>>>>>>>>>> method.  In the above implementation I chose to swap
nulls with
>>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best
to force the user
>>>>>>>>>>
>>>>>>>>> to
>>>>>>   make
>>>>>>>>>> that swap and then throw IllegalArgumentException
if they pass
>>>>>>>>>>
>>>>>>>>> null.
>>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>>
>>>>>>>>>>   wrote:
>>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>>
>>>>>>>>>>> I think that fixed width support is important
for a great many
>>>>>>>>>>>
>>>>>>>>>> rowkey
>>>>>>   constructs cases, so I'd rather see something like losing
>>>>>>>>>> MIN_VALUE
>>>>> and
>>>>>>>   keeping fixed width.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimiduk@gmail.com>
wrote:
>>>>>>>>>>>
>>>>>>>>>>>    Heya,
>>>>>>>>>>>
>>>>>>>>>>>> Thinking about data types and serialization.
I think null
>>>>>>>>>>>>
>>>>>>>>>>> support
>>>>> is
>>>>>
>>>>>>   an
>>>>>>>>>>>> important characteristic for the serialized
representations,
>>>>>>>>>>>> especially
>>>>>>>>>>>> when considering the compound type. However,
doing so in
>>>>>>>>>>>>
>>>>>>>>>>> directly
>>>>>   incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>>   instance,
>>>>>>>>>>> if we want to have a fixed-width signed long
stored on 8-bytes,
>>>>>>>>>>>
>>>>>>>>>> where
>>>>>>   do
>>>>>>>>>>>> you put null? float and double types can
cheat a little by
>>>>>>>>>>>>
>>>>>>>>>>> folding
>>>>>   negative
>>>>>>>>>>>> and positive NaN's into a single representation
(this isn't
>>>>>>>>>>>>
>>>>>>>>>>> strictly
>>>>>>   correct!), leaving a place to represent null. In the long
>>>>>>>>>>> example
>>>>>   case,
>>>>>>>>>>>> the
>>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or
increase MIN_VALUE by
>>>>>>>>>>>>
>>>>>>>>>>> one.
>>>>>>   This
>>>>>>>>>>>> will allocate an additional encoding which
can be used for null.
>>>>>>>>>>>>
>>>>>>>>>>> My
>>>>>>   experience working with scientific data, however, makes me wince
>>>>>>>>>>> at
>>>>>>   the
>>>>>>>>>>>> idea.
>>>>>>>>>>>>
>>>>>>>>>>>> The variable-width encodings have it a little
easier. There's
>>>>>>>>>>>>
>>>>>>>>>>> already
>>>>>>>   enough going on that it's simpler to make room.
>>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>>
>>>>>>>>>>> serialization.
>>>>>>>   This
>>>>>>>>>>>> imposes some limitations on our encoding
strategies. For
>>>>>>>>>>>>
>>>>>>>>>>> instance,
>>>>>   it's
>>>>>>>>>>>> not
>>>>>>>>>>>> enough to simply encode null, it really needs
to be encoded as
>>>>>>>>>>>>
>>>>>>>>>>> 0x00
>>>>>> so
>>>>>>
>>>>>>>   as
>>>>>>>>>>> to sort lexicographically earlier than any other
value.
>>>>>>>>>>>
>>>>>>>>>>>> What do you think? Any ideas, experiences,
etc?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>


Mime
View raw message