arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Korn <uw...@xhochy.com>
Subject Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)
Date Wed, 07 Sep 2016 04:51:15 GMT
Hello,

I'm also in favour of switching the dependency direction between Parquet 
and Arrow as this would avoid a lot of duplicate code in both projects 
as well as parquet-cpp profiting from functionality that is available in 
Arrow.

@wesm: go ahead with the JIRAs and I'll add comments or will pick some 
of them up.

Cheers

Uwe


On 07.09.16 04:41, Wes McKinney wrote:
> hi Julien,
>
> It makes sense to move the Parquet support for Arrow into Parquet
> itself and invert the dependency. I had thought that the coupling to
> Arrow C++'s IO subsystem might be tighter, but the connection between
> memory allocators and file abstractions is fairly simple:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h
>
> I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring.
>
> The exposure of the Parquet functionality in Python should stay inside
> Arrow for now, but mainly because it would make developing the Python
> side of things much more difficult if we split things up right now.
>
> - Wes
>
> On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman <Brian.Bowman@sas.com> wrote:
>> Forgive me if interposing my first post for the Apache Arrow project on this thread
is incorrect procedure.
>>
>> What Julien proposes with each storage layer producing Arrow Record Batches is exactly
how I envision it working and would certainly make Arrow integration with SAS much more palatable.
 This is likely true for other storage layer providers as well.
>>
>> Brian Bowman (SAS)
>>
>>> On Sep 6, 2016, at 7:52 PM, Julien Le Dem <julien@dremio.com> wrote:
>>>
>>> Thanks Wes,
>>> No worries, I know you are on top of those things.
>>> On a side note, I was wondering if the arrow-parquet integration should be
>>> in Parquet instead.
>>> Parquet would depend on Arrow and not the other way around.
>>> Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
>>> ...) provides a way to produce Arrow Record Batches.
>>> thoughts?
>>>
>>>> On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
>>>>
>>>> hi Julien,
>>>>
>>>> I'm very sorry about the inconvenience with this and the delay in
>>>> getting it sorted out. I will triage this evening by disabling the
>>>> Parquet tests in Arrow until we get the current problems under
>>>> control. When we re-enable the Parquet tests in Travis CI I agree we
>>>> should pin the version SHA.
>>>>
>>>> - Wes
>>>>
>>>>> On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem <julien@dremio.com>
wrote:
>>>>> The Arrow cpp travis-ci build is broken right now because it depends
on
>>>>> parquet-cpp which has changed in an incompatible way. [1] [2] (or so
it
>>>>> looks to me)
>>>>> Since parquet-cpp is not released yet it is totally fine to make
>>>>> incompatible API changes.
>>>>> However, we may want to pin the Arrow to Parquet dependency (on a git
>>>> sha?)
>>>>> to prevent cross project changes from breaking the master build.
>>>>> Since I'm not one of the core cpp dev on those projects I mainly want
to
>>>>> start that conversation rather than prescribe a solution. Feel free to
>>>> take
>>>>> this as a straw man and suggest something else.
>>>>>
>>>>> [1] https://travis-ci.org/apache/arrow/jobs/156080555
>>>>> [2]
>>>>> https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d
>>>> 5af150dd31/ci/travis_before_script_cpp.sh
>>>>>
>>>>> --
>>>>> Julien
>>>
>>>
>>> --
>>> Julien


Mime
View raw message