Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of brock@cloudera.com designates
 209.85.216.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABgNGzc=0GH1FWG=Q2CRS9eM0Pz18S4u9fJL4q8bPGtR8knRqA@mail.gmail.com>
References: 
 <CAFukC=6XsS1KjgAD7HV2V4wWoiGjzctm1RuJCCzSOCDJ8X31eg@mail.gmail.com>
 <CABgNGzc=0GH1FWG=Q2CRS9eM0Pz18S4u9fJL4q8bPGtR8knRqA@mail.gmail.com>
From: Brock Noland <brock@cloudera.com>
Date: Thu, 6 Mar 2014 12:37:59 -0600
Message-ID: 
 <CAFukC=4sxfVjXeBG5xwzhCM_BBzrFHU+zrXrgsQJE=+LPnq8gw@mail.gmail.com>
Subject: Re: Thoughts on new metastore APIs
To: "dev@hive.apache.org" <dev@hive.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Mar 6, 2014 at 12:13 PM, Thejas Nair <thejas@hortonworks.com> wrote:
> Thanks for discussing this, Brock. I agree that is important to
> consider while writing a new metastore api call.
> But I think this (single input/output struct)  should be a guideline,
> I am not sure if this should be used in every case.

As with any rule, there are always exceptions. However, looking at the
new API's I don't see an instance where it would would have been
harmful to use this model. Exceptions should be extremely rare since
thousands of RPC implementations successfully use the request/response
model. However, as always, it would be up to the developers working on
the change to make this call. They'd do so knowing they are going
against the guideline and could be asked to justify why they are doing
so.

> What you are saying shows that there is a tradeoff between ending up
> with more functions vs ending up with more input/output
> structs/classes. I am not sure if having more input/output structs is
> any better.
>
> Take the case of create_table/create_table_with_environment_context
> that you mentioned. Even though create_table had a single input
> argument Table, instead of adding  EnvironmentContext contents to
> Table struct, the authors decided to create a new function with
> additional  EnvironmentContext argument. This makes sense because the
> Table struct is used by other functions as well such as get_table, and
> EnvironmentContext fields don't make sense for those cases as it is
> not persisted as part of table.
>
> Which means that the only way to prevent creation of
> create_table_with_environment_context method would have been to have a
> CreateTableArg struct as input argument instead of Table as the
> argument. ie, creating a different struct for the single input/output
> every function is only way you can be sure that you don't need more
> functions.

RPC methods are "special". They are published to the world and
therefore cannot be easily modified or refactored. Once we create a
new RPC method, we are stuck with it for a very long time. In this
way, Thrift is rather strange in that it allows you to exposed the api
signatures. The request/response model is far more common.

Therefore, if we were creating create_table from scratch, I would
suggest we use:

CreateTableRequest/CreateTableResponse

That way, we can add optional arguments such as "environment context"
very easily. Although, I'd love to take credit for this idea, I didn't
just come up with this myself. This is a standard way of handling RPC.

> This approach of reducing the number of functions also means that you
> would start encoding different types of actions within the single
> input argument.
>
> Consider the case of get_partition vs get_partition_by_name. It would
> need a single struct with an enum that tells if it to lookup based on
> the partition key-values or the name, and based on the enum it would
> use different fields in the struct. I feel having different functions
> is more readable for this case.

There will be cases where we'll need similar methods. This point where
is with the request/response model, adding a single parameter doesn't
require an entirely new method. The developers working on the change
would have to make the call as to whether a new functionality requires
a new API or can be handled within the current API.

> For example, the api in HIVE-5931 would need to change from
> ====================================
> list<RolePrincipalGrant> get_principals_in_role(1:string role_name)
> ====================================
> to
>
> ====================================
> struct GetPrincipalInRoleOutput{
>  1:list<RolePrincipalGrant> rolePrincList;
> }
>
> struct GetPrincipalInRoleInput{
> 1:string role_name;
> }
> GetPrincipalInRoleOutput get_principals_in_role(1:
> GetPrincipalInRoleInput input);
> ====================================
>
> I am not sure if the insurance costs in terms of readability is low
> here. I think we should consider the risk in each case of function
> proliferation and pay the costs accordingly.
> Let me know if I have misunderstood what you are proposing here.

The "Output" and "Input" names do feel odd. As I said above.
Request/Response are standard names for these kinds of objects. Today,
for a method like the one above, it may feel like overheard or extra
work. However, in the future when you want to add another parameter
such as "isFilter" or "encoding" etc, then the insurance pays off big
time.

Brock