drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Luciani <j...@apache.org>
Subject Re: Naming the new ValueVector Initiative
Date Thu, 21 Jan 2016 01:21:45 GMT
That's great! So it's going straight to TLP?
Hey Everyone,

Good news! The Apache board has approved the Apache Arrow as a new TLP.
I've asked the Apache INFRA team to set up required resources so we can
start moving forward (ML, Git, Website, etc).

I've started working on a press release to announce the Apache Arrow
project and will circulate a draft shortly. Once the project mailing lists
are established, we can move this thread over there to continue
discussions. They had us do one of change to the proposal during the board
call which was to remove the initial committers (separate from initial
pmc). Once we establish the PMC list, we can immediately add the additional
committers as our first PMC action.

thanks to everyone!
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <julien@dremio.com> wrote:

> +1 on a repo for the spec.
> I do have questions as well.
> In particular for the metadata.
>
> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <wes@cloudera.com> wrote:
>
>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <parthc@apache.org> wrote:
>> >
>> >
>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <wes@cloudera.com> wrote:
>> >>
>> >>
>> >> >
>> >> > As far as the existing work is concerned, I'm not sure everyone is
>> aware
>> >> > of
>> >> > the C++ code inside of Drill that can represent at least the scalar
>> >> > types in
>> >> > Drill's existing Value Vectors [1]. This is currently used by the
>> native
>> >> > client written to hook up an ODBC driver.
>> >> >
>> >>
>> >> I have read this code. From my perspective, it would be less work to
>> >> collaborate on a self-contained implementation that closely models the
>> >> Arrow / VV spec that includes builder classes and its own memory
>> >> management without coupling to Drill details. I started prototyping
>> >> something here (warning: only a few actual days of coding here):
>> >>
>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow
>> >>
>> >> For example, you can see an example constructing an Array<Int32> or
>> >> String (== Array<UInt8>) column in the tests here
>> >>
>> >>
>> >>
>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328
>> >>
>> >> I've been planning to use this as the basis of a C++ Parquet
>> >> reader-writer and the associated Python pandas-like layer which
>> >> includes in-memory analytics on Arrow data structures.
>> >>
>> >> > Parth who is included here has been the primary owner of this C++
>> code
>> >> > throughout it's life in Drill. Parth, what do you think is the best
>> >> > strategy
>> >> > for managing the C++ code right now? As the C++ build is not tied
>> into
>> >> > the
>> >> > Java one, as I understand it we just run it manually when updates are
>> >> > made
>> >> > there and we need to update ODBC. Would it be disruptive to move the
>> >> > code to
>> >> > the arrow repo? If so, we could include Drill as a submodule in the
>> new
>> >> > repo, or put Wes's work so far in the Drill repo.
>> >>
>> >> If we can enumerate the non-Drill-client related parts (i.e. the array
>> >> accessors and data structures-oriented code) that would make sense in
>> >> a standalone Arrow library it would be great to start a side
>> >> discussion about the design of the C++ reference implementation
>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since
>> >> this is a quite urgent for me (intending to deliver a minimally viable
>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it
>> >> would be great to do this sooner rather than later.
>> >>
>> >
>> > Most of the code for  Drill C++ Value Vectors is independent of Drill -
>> > mostly the code upto line 787 in this file -
>> >
>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp
>> >
>> > My thought was to leave the Drill implementation alone and borrow
>> copiously
>> > from it when convenient for Arrow. Seems like we can still do that
>> building
>> > on Wes' work.
>> >
>>
>> Makes sense. Speaking of code, would you all like me to set up a
>> temporary repo for the specification itself? I already have a few
>> questions like how and where to track array null counts.
>>
>> > Wes, let me know if you want to have a quick hangout on this.
>> >
>>
>> Sure, I'll follow up separately to get something on the calendar.
>> Looking forward to connecting!
>>
>> > Parth
>> >
>> >
>>
>
>
>
> --
> Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message