Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@arrow.apache.org
From: Julian Hyde <jhyde@apache.org>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0
Date: Wed, 26 Jul 2017 12:13:57 -0700
References: <CAJPUwMCeKsqcXf=oun1OTLa_AUTKAxxznMgftOHwo-6TPpf+UA@mail.gmail.com>
 <CAKa9qDmd6T8DR_JFT1TcPuNxeq1n17HzNQ7==QX0QDxiQGZWug@mail.gmail.com>
 <CAJPUwMAC6LGWyRUy4nCeHhMNZopbrEw7a5WaR6rgbFT+O+V4-g@mail.gmail.com>
 <CAJPUwMCF32zx-oAuJptOMWgG94eo1wTKUw4P=vOCMPLv-Or2tQ@mail.gmail.com>
 <8D61112F-E65E-4877-809C-1B6BBDC5B330@apache.org>
 <CAJPUwMBT38TuuKYdxQytXeOw7UOr6Mt6cQvcMxFPjKyXm58sdg@mail.gmail.com>
 <CAJPUwMAo9N75-J_6cMPneKfOxat-82i-55Xa4GF5GfBBKONyTQ@mail.gmail.com>
To: dev@arrow.apache.org
In-Reply-To: <CAJPUwMAo9N75-J_6cMPneKfOxat-82i-55Xa4GF5GfBBKONyTQ@mail.gmail.com>
Message-Id: <B76997CC-6FC4-4842-8DAF-03B3954C0AFC@apache.org>
archived-at: Wed, 26 Jul 2017 19:14:17 -0000

I agree with all that. But semantic versioning only pertains to public =
APIs. So, for it to work, you need to declare what are your public APIs. =
If you don=E2=80=99t, people will make assumptions about what are your =
public APIs, and they may get it wrong.

The ability to add experimental APIs (not subject to semantic versioning =
until they are officially declared public) will help the project evolve =
and stay relevant.

Julian


> On Jul 26, 2017, at 12:02 PM, Wes McKinney <wesmckinn@gmail.com> =
wrote:
>=20
> I see the semantic versioning like this:
>=20
> Major version: Format and Metadata stability
> Minor version: API stability within fix versions
> Fix version: Bug fixes
>=20
> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
> make a breaking change to the memory format without increasing the
> major version. We also have the added protection of a version enum in
> the metadata
>=20
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
>=20
> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <wesmckinn@gmail.com> =
wrote:
>> Given the nature of the Arrow project, where any number of different
>> implementations will be in flux at any given time, claiming any sort
>> of API stability at the code level across the whole project seems
>> impossible any time soon.
>>=20
>> The important commitment of a 1.0 release is that the metadata and
>> memory format is not changing (without a change in the major version
>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is =
the
>> memory format and serialized metadata representation. That is, the
>> files in
>>=20
>> https://github.com/apache/arrow/tree/master/format
>>=20
>> Having this kind of stability is really important so that if any
>> systems know how to parse or emit Arrow 1.x data, but aren't
>> necessarily using the libraries provided by the project, they can =
have
>> some assurance that we aren't going to break the Flatbuffers or the
>> arrangement of bytes in a record batch on the wire. If that makes
>> sense.
>>=20
>> - Wes
>>=20
>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jhyde@apache.org> =
wrote:
>>> 1.0 is a Big Deal because, under semantic versioning, there is a =
commitment to not change public APIs. If it weren=E2=80=99t for that, =
1.0 would have vague marketing connotations of robustness, adoption etc. =
but otherwise be no different from another release.
>>>=20
>>> So, if API and data format lifecycle and compatibility is the goal =
here, would it be useful to introduce explicit flags on API maturity? =
Call out which APIs are public, and therefore bound by the semantic =
versioning contract. This will also give Arrow some room to add =
experimental features after 1.0, and avoid calcification.
>>>=20
>>> Julian
>>>=20
>>>=20
>>>=20
>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <wesmckinn@gmail.com> =
wrote:
>>>>=20
>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>> integration testing remaining data types. We are so close to having
>>>> everything tested and stable, we should push to complete these as =
soon
>>>> as possible (save for Map, which has only just been added to the
>>>> metadata)
>>>>=20
>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <wesmckinn@gmail.com> =
wrote:
>>>>> I agree those things would be nice to have. Hardening the memory
>>>>> format details probably would not take longer than a month or so =
if we
>>>>> were to focus in on it.
>>>>>=20
>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or =
will
>>>>> require a design period and then initial implementation. I think
>>>>> having the streaming format implementations is a good start, but =
the
>>>>> streams are a bit monothic -- e.g. in REST you might want to =
request
>>>>> metadata only, or only record batches given a known schema. We =
should
>>>>> create a proposal document (Google docs?) for the community to =
comment
>>>>> where we can iterate on requirements
>>>>>=20
>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>> transport layers, like GRPC. The recent refactoring in C++ to make =
the
>>>>> streams less monolithic was intended to help with that.
>>>>>=20
>>>>> - Wes
>>>>>=20
>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau =
<jacques@apache.org> wrote:
>>>>>> Top things on my list:
>>>>>>=20
>>>>>> - Formalize Arrow RPC and/or REST
>>>>>> - Some reference transformation algorithms
>>>>>> - Prototype IPC
>>>>>>=20
>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney =
<wesmckinn@gmail.com> wrote:
>>>>>>=20
>>>>>>> hi folks,
>>>>>>>=20
>>>>>>> In recent discussions, since the Arrow memory format and =
metadata has
>>>>>>> become reasonably stabilized, and we're more likely to add new =
data
>>>>>>> types than change existing ones, we may consider making a 1.0.0 =
to
>>>>>>> declare to the rest of the open source world that "Arrow is open =
for
>>>>>>> business" and can be relied upon in production applications =
(which
>>>>>>> some reasonable tolerance for library API changes from major =
release
>>>>>>> to major release). I hope we can all agree that forward and =
backward
>>>>>>> compatibility in the zero-copy wire format and metadata is the =
most
>>>>>>> essential thing.
>>>>>>>=20
>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>> accomplished in the project before we'd be comfortable making a =
1.0.0
>>>>>>> release. I think it would be a good show of project stability /
>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>> continue to evolve).
>>>>>>>=20
>>>>>>> The main things on my end are hardening the memory format and
>>>>>>> integration tests for the remaining data types:
>>>>>>>=20
>>>>>>> - Decimals
>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>   - Need integration tests
>>>>>>> - Fixed size list
>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>> - Union
>>>>>>>   - Two kinds of unions, Java only implements one. Need =
integration tests
>>>>>>>=20
>>>>>>> On these, Decimals have the most work since the memory format =
needs to
>>>>>>> be specified. On Unions, we may decide to not implement the =
dense
>>>>>>> variant and focus on integration testing the sparse variant. I =
don't
>>>>>>> think this is going to be too much work, but it needs to get =
sorted
>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>> specification.
>>>>>>>=20
>>>>>>> There's some other things being discussed, like a Map logical =
type,
>>>>>>> but that (at least as currently proposed) won't require any =
disruptive
>>>>>>> modifications to the metadata.
>>>>>>>=20
>>>>>>> As far as the metadata and memory format, we would use the =
Open/Closed
>>>>>>> principle to guide our efforts
>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For =
example, it
>>>>>>> would be possible to add compression or encoding at the field =
level
>>>>>>> without disrupting earlier versions of the software that lack =
these
>>>>>>> features.
>>>>>>>=20
>>>>>>> In the event that we do need to change the metadata or memory =
format
>>>>>>> in the future (which would probably be an extreme circumstance), =
we
>>>>>>> have the option of increasing the MetadataVersion which is one =
of the
>>>>>>> first tags accompanying Arrow messages
>>>>>>> =
(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>> So if you encounter a message that you do not support, you can =
raise
>>>>>>> an appropriate exception.
>>>>>>>=20
>>>>>>> There are some other things that would be nice to prototype or
>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>> calls).
>>>>>>>=20
>>>>>>> Anything else that would need to go to move to a 1.x mainline =
for
>>>>>>> development? One idea would be if we need to make any breaking =
changes
>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches =
into
>>>>>>> maintenance mode.
>>>>>>>=20
>>>>>>> Thanks
>>>>>>> Wes
>>>>>>>=20
>>>=20