arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0
Date Wed, 26 Jul 2017 19:02:52 GMT
I see the semantic versioning like this:

Major version: Format and Metadata stability
Minor version: API stability within fix versions
Fix version: Bug fixes

So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
make a breaking change to the memory format without increasing the
major version. We also have the added protection of a version enum in
the metadata

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22

On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
> Given the nature of the Arrow project, where any number of different
> implementations will be in flux at any given time, claiming any sort
> of API stability at the code level across the whole project seems
> impossible any time soon.
>
> The important commitment of a 1.0 release is that the metadata and
> memory format is not changing (without a change in the major version
> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
> memory format and serialized metadata representation. That is, the
> files in
>
> https://github.com/apache/arrow/tree/master/format
>
> Having this kind of stability is really important so that if any
> systems know how to parse or emit Arrow 1.x data, but aren't
> necessarily using the libraries provided by the project, they can have
> some assurance that we aren't going to break the Flatbuffers or the
> arrangement of bytes in a record batch on the wire. If that makes
> sense.
>
> - Wes
>
> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jhyde@apache.org> wrote:
>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not
change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations
of robustness, adoption etc. but otherwise be no different from another release.
>>
>> So, if API and data format lifecycle and compatibility is the goal here, would it
be useful to introduce explicit flags on API maturity? Call out which APIs are public, and
therefore bound by the semantic versioning contract. This will also give Arrow some room to
add experimental features after 1.0, and avoid calcification.
>>
>> Julian
>>
>>
>>
>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>
>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>> integration testing remaining data types. We are so close to having
>>> everything tested and stable, we should push to complete these as soon
>>> as possible (save for Map, which has only just been added to the
>>> metadata)
>>>
>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>> I agree those things would be nice to have. Hardening the memory
>>>> format details probably would not take longer than a month or so if we
>>>> were to focus in on it.
>>>>
>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>> require a design period and then initial implementation. I think
>>>> having the streaming format implementations is a good start, but the
>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>> metadata only, or only record batches given a known schema. We should
>>>> create a proposal document (Google docs?) for the community to comment
>>>> where we can iterate on requirements
>>>>
>>>> Separately, I'm interested in embedding Arrow streams in other
>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>> streams less monolithic was intended to help with that.
>>>>
>>>> - Wes
>>>>
>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <jacques@apache.org>
wrote:
>>>>> Top things on my list:
>>>>>
>>>>> - Formalize Arrow RPC and/or REST
>>>>> - Some reference transformation algorithms
>>>>> - Prototype IPC
>>>>>
>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <wesmckinn@gmail.com>
wrote:
>>>>>
>>>>>> hi folks,
>>>>>>
>>>>>> In recent discussions, since the Arrow memory format and metadata
has
>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>> declare to the rest of the open source world that "Arrow is open
for
>>>>>> business" and can be relied upon in production applications (which
>>>>>> some reasonable tolerance for library API changes from major release
>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>> essential thing.
>>>>>>
>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>> release. I think it would be a good show of project stability /
>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>> continue to evolve).
>>>>>>
>>>>>> The main things on my end are hardening the memory format and
>>>>>> integration tests for the remaining data types:
>>>>>>
>>>>>> - Decimals
>>>>>>    - Lingering issues with 128-bit decimals
>>>>>>    - Need integration tests
>>>>>>  - Fixed size list
>>>>>>    - Java has implemented, but not C++. Need integration tests
>>>>>>  - Union
>>>>>>    - Two kinds of unions, Java only implements one. Need integration
tests
>>>>>>
>>>>>> On these, Decimals have the most work since the memory format needs
to
>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>> specification.
>>>>>>
>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>> modifications to the metadata.
>>>>>>
>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>> principle to guide our efforts
>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example,
it
>>>>>> would be possible to add compression or encoding at the field level
>>>>>> without disrupting earlier versions of the software that lack these
>>>>>> features.
>>>>>>
>>>>>> In the event that we do need to change the metadata or memory format
>>>>>> in the future (which would probably be an extreme circumstance),
we
>>>>>> have the option of increasing the MetadataVersion which is one of
the
>>>>>> first tags accompanying Arrow messages
>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>> an appropriate exception.
>>>>>>
>>>>>> There are some other things that would be nice to prototype or
>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>> calls).
>>>>>>
>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>> maintenance mode.
>>>>>>
>>>>>> Thanks
>>>>>> Wes
>>>>>>
>>

Mime
View raw message