Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D912411FBD for ; Mon, 19 May 2014 13:32:21 +0000 (UTC) Received: (qmail 7499 invoked by uid 500); 19 May 2014 13:32:21 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 7406 invoked by uid 500); 19 May 2014 13:32:21 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 7398 invoked by uid 99); 19 May 2014 13:32:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 13:32:21 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of andrew.purtell@gmail.com designates 209.85.220.52 as permitted sender) Received: from [209.85.220.52] (HELO mail-pa0-f52.google.com) (209.85.220.52) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 13:32:18 +0000 Received: by mail-pa0-f52.google.com with SMTP id fa1so5820883pad.39 for ; Mon, 19 May 2014 06:31:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:from:subject:date:to; bh=ZJLLQu7kozKMTR5qly8KqWpRQTpGvoirsmCYArpRTbM=; b=L6wF6Vynh6NYBSD06k3zQO+n9ZPanlKe4eenVBGMrvB9WB6V5IO9VJlq3VBCTUnYeJ MJM7Ylpdo/WhC4nJSUQ5FKVWsZR828Vdurqalz2OxmwJV7ypoD3AYZiuViDLxx5o4XHz U+sS/kgx/nhlRmP/WqDHe0e113B5dkrwo1g2XmswuZ+m3Wxu9xA5AjH4YvCdm8MMfS7K v2N/mO19vujxNf1McCG9A5PdWQ5FRHFVqQQhSZW4lrkRUWJzFTs7f/oBc0boTS5vAJig jM0AtpXNkuPkUVfbviOwz9gUxqy7D+teJaajiVJ+L0yLdmQsbuSaSFaACKkOZpYmQgea 0Lcw== X-Received: by 10.68.249.2 with SMTP id yq2mr43071221pbc.70.1400506313968; Mon, 19 May 2014 06:31:53 -0700 (PDT) Received: from [10.253.199.214] (mobile-198-228-208-181.mycingular.net. [198.228.208.181]) by mx.google.com with ESMTPSA id pv4sm2267700pac.14.2014.05.19.06.31.52 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 19 May 2014 06:31:52 -0700 (PDT) References: <53729E4B.3030605@cloudera.com> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: Cc: Ryan Blue , "jamestaylor@apache.org" X-Mailer: iPhone Mail (11D201) From: Andrew Purtell Subject: Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes Date: Mon, 19 May 2014 06:31:49 -0700 To: "dev@hbase.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org So if I can summarize this thread so far, we are going to try and hammer out= a types encoding spec agreeable to HBase, Phoenix, and Kite alike? As oppos= ed to select a particular implementation today as both spec and reference im= plementation. Is that correct? If so, that sounds like a promising direction. The HBase types library has t= he flexibility, if I understand Nick correctly, to accommodate whatever is a= greed upon and we could then provide a reference implementation as a service= for HBase users (or anyone) but there would be no strings attached, multipl= e implementations of the spec would interoperate by definition.=20 > On May 19, 2014, at 3:20 AM, Nick Dimiduk wrote: >=20 > On Thu, May 15, 2014 at 9:32 AM, James Taylor wrot= e: >=20 >> @Nick - I like the abstraction of the DataType, but that doesn't solve th= e >> problem for non Java usage. >=20 >=20 > That's true. It's very much a Java construct. Likewise, Struct only codes > for semantics; there's no encoding defined there. For correct > multi-language support, we'll need to define these semantics the same way > we do the encoding details so that implementations can reproduce them > faithfully. >=20 > I'm also a bit worried that it might become a bottleneck for implementors >> of the serialization spec as there are many different platform specific >> operations that will likely be done on the row key. We can try to get >> everything necessary in the DataType interface, but I suspect that >> implementors will need to go under-the-covers at times (rather than waiti= ng >> for another release of the module that defines the DataType interface) - >> might become a bottleneck. >=20 > Time will tell. DataType is just an interface, after all. If there are > things it's missing (as there surely are, for Phoenix...), it'll need to b= e > extended locally until these features can be pushed down into HBase. HBase= > release managers have been faithful to the monthly release train, so I > think in practice dependent projects won't have to wait long. I'm content > to take this on a case-by-case basis and watch for a trend. Do you have an= > alternative idea? >=20 >> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk wrote:= >>=20 >>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue wrote: >>>=20 >>>=20 >>>> I think there's a little confusion in what we are trying to accomplish.= >>>> What I want to do is to write a minimal specification for how to store >> a >>>> set of types. I'm not trying to leave much flexibility, what I want is >>>> clarity and simplicity. >>>=20 >>> This is admirable and was my initial goal as well. The trouble is, you >>> cannot please everyone, current users and new. So, we decided it was >> better >>> to provide a pluggable framework for extension + some basic >> implementations >>> than to implement a closed system. >>>=20 >>> This is similar to OrderedBytes work, but a subset of it. A good example= >> is >>>> that while it's possible to use different encodings (avro, protobuf, >>>> thrift, ...) it isn't practical for an application to support all of >>> those >>>> encodings. So for interoperability between Kite, Phoenix, and others, I= >>>> want a set of requirements that is as small as possible. >>>=20 >>> Minimal is good. The surface area of o.a.h.h.types is as large as it is >>> because there was always "just one more" type to support or encoding to >>> provide. >>>=20 >>> To make the requirements small, I used off-the-shelf protobuf [1] plus a= >>>> small set of memcmp encodings: ints, floats, and binary. That way, we >>> don't >>>> have to talk about how to make a memcmp Date in bytes, for example. A >>> Date >>>> is an int, which we know how to encode, and we can agree separately on >>> how >>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same >>> applies >>>> to binary, where the encoding handles sorting and nulls, but not >>> charsets. >>>=20 >>> I think you should focus on the primitives you want to support. The >>> compound type stuff (ie, "rowkey encodings") is a can of worms because >> you >>> need to support existing users, new users, novice users, and advanced >>> users. Hence the interop between the DataType interface and the Struct >>> classes. These work together to support all of these use-cases with the >>> same basic code. For example, the protobuf encoding of postion|wire-type= >> + >>> encoded value is easily implemented using Struct. >>>=20 >>> I firmly believe that we cannot dictate rowkey composition. Applications= , >>> however, are free to implement their own. By using the common DataType >>> interface, they can all interoperate. >>>=20 >>> This is the largest reason why I didn't include OrderedBytes directly in= >>>> the spec. For example, OB includes a varint that I don't think is >>> needed. I >>>> don't object to its inclusion in OB, but I think it isn't a necessary >>>> requirement for implementing this spec. >>>=20 >>> Again, the surface area is as it is because of community consensus durin= g >>> the first phase of implementation. That consensus disagrees with you. >>>=20 >>> I think there are 3 things to clear up: >>>> 1. What types from OB are not included, and why? >>>> 2. Why not use OB-style structs? >>>> 3. Why choose protobuf for complex records? >>>>=20 >>>> Does that sound like a reasonable direction to head with this >> discussion? >>>=20 >>> Yes, sounds great! >>>=20 >>> As far as the DataType API, I think that works great with what I'm tryin= g >>>> to do. We'd build a DataType implementation for the encoding and the >> API >>>> will applications handle the underlying encoding. And other encoding >>>> strategies can be swapped in as well, if we want to address >> shortcomings >>> in >>>> this one, or have another for a different use case. >>>=20 >>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji ar= e >>> the target audience of the DataType API. >>>=20 >>> Thank you for picking back up this baton. It's sat for too long. >>>=20 >>> -n >>>=20 >>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote: >>>>=20 >>>>> Breaking off hackathon thread. >>>>>=20 >>>>> The conversation around HBASE-8089 concluded with two points: >>>>> - HBase should provide support for order-preserving encodings while >>>>> not dropping support for the existing encoding formats. >>>>> - HBase is not in the business of schema management; that is a >>>>> responsibility left to application developers. >>>>>=20 >>>>> To handle the first point, OrderedBytes is provided. For the >> supporting >>>>> the second, the DataType API is introduced. By introducing this layer >>>>> above specific encoding formats, it gives us a hook for plugging in >>>>> different implementations and for helper utilities to ship with HBase,= >>>>> such as HBASE-10091. >>>>>=20 >>>>> Things get fuzzy around complex data types: pojos, compound rowkeys (a= >>>>> special case of pojo), maps/dicts, and lists/arrays. These types are >>>>> composed of other types and have different requirements based on where= >>>>> in the schema they're used. Again, by falling back on the DataType >> API, >>>>> we give application developers an "out" for doing what makes the most >>>>> sense for them. >>>>>=20 >>>>> For compound rowkeys, the Struct class is designed to fill in this >> gap, >>>>> sitting between data encoding and schema expression. It gives the >>>>> application implementer, the person managing the schema, enough >>>>> flexibility express the key encoding in terms of the component types. >>>>> These components are not limited to the simple primitives already >>>>> defined, but any DataType implementation. Order preservation is likely= >>>>> important here. >>>>>=20 >>>>> For arrays/lists, there's no implementation yet, but you can see how >> it >>>>> might be done if you have a look at struct. Order preservation may or >>>>> may not be important for arrays/list. >>>>>=20 >>>>> The situation for maps/dicts is similar to arrays/lists. The one >>>>> complication is the case where you want to map to a column family. How= >>>>> can these APIs support this thing? >>>>>=20 >>>>> Pojos are a little more complicated. Probably Struct is sufficient for= >>>>> basic cases, but it doesn't support nice features like versioning -- >>>>> these are sacrificed in favor of order preservation. Luckily, there's >>>>> plenty of tools out there for this already: Avro, MessagePack, >> Protobuf, >>>>> Thrift, &c. There's no need to reinvent the wheel here. Application >>>>> developers can implement the DataType API backed by their management >>>>> tool of choice. I created HBASE-11161 and will post a patch shortly. >>>>>=20 >>>>> Specific comments about the Hackathon notes inline. >>>>>=20 >>>>> Thanks, >>>>> Nick >>>>=20 >>>>=20 >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Cloudera, Inc. >>=20