hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Shelukhin <ser...@hortonworks.com>
Subject Re: [DISCUSS] Making storage-api a separately released artifact
Date Fri, 19 Aug 2016 23:14:19 GMT
Can we just run the versions thru? I.e. increment it every time but
release only one component (or both if they happen to align I guess).
E.g. storage-api will be released at 2.2, and say 2.3 if it moves fast,
then Hive 2.4, then storage-api 2.5, etc.
That might make it easier to reason about compatibility because the order
is obvious.

On 16/8/19, 09:04, "Sergio Pena" <sergio.pena@cloudera.com> wrote:

>I see Parquet is currently using the SearchArgument class for predicates
>push down.
>Will this class be part of the new sub-module or project?
>
>Following Sushanth idea, can we have other API interfaces in the new
>project that other components can use?
>Perhaps having this may be a good reason to create a project.
>
>I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
>version number for incompatible changes is ugly and confusing.
>I like the new project idea more, +1, but  the storage-api may be too
>small
>for a new project.
>
>- Sergio
>
>On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <omalley@apache.org> wrote:
>
>> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfgates@gmail.com>
>>wrote:
>>
>> > +1 for making the API clean and easy for other projects to work with.
>> A
>> > few questions:
>> >
>> > 1) Would this also make it easier for Parquet and others to implement
>> > Hive’s ACID interfaces?
>> >
>>
>> Currently the ACID interfaces haven't been moved over to storage-api,
>> although it would make sense to do so at some point.
>>
>>
>> >
>> > 2) Would we make any attempt to coordinate version numbers between
>>Hive
>> > and the storage module, or would a given version of Hive just depend
>>on a
>> > given version of the storage module?
>> >
>>
>> The two options that I see are:
>>
>> * Let the numbers run separately starting from 2.2.0.
>> * Tie the numbers together with an additional level of versioning (eg.
>> 2.2.0.0).
>>
>> I think that letting the two version numbers diverge is better in the
>>long
>> term. For example, if you need to make an incompatible change, it is
>>pretty
>> ugly to do it as a fourth level version number (eg. an incompatible
>>change
>> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api
>>would
>> move faster than Hive, but as it stabilizes I expect it might start
>>moving
>> slower than Hive.
>>
>> I'd propose that we have Hive's build use a released version of
>>storage-api
>> rather than a snapshot.
>>
>> Thoughts?
>>
>>    Owen
>>
>>
>> > Alan.
>> >
>> > > On Aug 15, 2016, at 17:01, Owen O'Malley <omalley@apache.org> wrote:
>> > >
>> > > All,
>> > >
>> > > As part of moving ORC out of Hive, we pulled all of the
>>vectorization
>> > > storage and sarg classes into a separate module, which is named
>> > > storage-api.  Although it is currently only used by ORC, it could be
>> used
>> > > by Parquet or Avro if they wanted to make a fast vectorized reader
>>that
>> > > read directly in to Hive's VectorizedRowBatch without needing a
>>shim or
>> > > data copy. Note that this is in many ways similar to pulling the
>>Arrow
>> > > project out of Drill.
>> > >
>> > > This unfortunately still leaves us with a circular dependency
>>between
>> > Hive
>> > > and ORC. I'd hoped that storage-api wouldn't change that much, but
>>that
>> > > doesn't seem to be happening. As a result, ORC ends up shipping its
>>own
>> > > fork of storage-api.
>> > >
>> > > Although we could make a new project for just the storage-api, I
>>think
>> it
>> > > would be better to make it a subproject of Hive that is released
>> > > independently.
>> > >
>> > > What do others think?
>> > >
>> > >   Owen
>> >
>> >
>>

Mime
View raw message