drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Naming the new ValueVector Initiative
Date Thu, 21 Jan 2016 22:00:27 GMT
To expand on what “straight to TLP” means (correct me if I’m wrong, Jacques).

From an IP standpoint, the new project is a clone of Drill. It starts off with Drill’s code
base. We then, as the sculptor said [1],  chip away everything that doesn’t look like Arrow.

Julian

[1] http://quoteinvestigator.com/2014/06/22/chip-away/

> On Jan 20, 2016, at 7:15 PM, Jacques Nadeau <jacques@dremio.com> wrote:
> 
> Yep, straight to TLP.
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> On Wed, Jan 20, 2016 at 5:21 PM, Jake Luciani <jake@apache.org> wrote:
> 
>> That's great! So it's going straight to TLP?
>> Hey Everyone,
>> 
>> Good news! The Apache board has approved the Apache Arrow as a new TLP.
>> I've asked the Apache INFRA team to set up required resources so we can
>> start moving forward (ML, Git, Website, etc).
>> 
>> I've started working on a press release to announce the Apache Arrow
>> project and will circulate a draft shortly. Once the project mailing lists
>> are established, we can move this thread over there to continue
>> discussions. They had us do one of change to the proposal during the board
>> call which was to remove the initial committers (separate from initial
>> pmc). Once we establish the PMC list, we can immediately add the additional
>> committers as our first PMC action.
>> 
>> thanks to everyone!
>> Jacques
>> 
>> 
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>> 
>> On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <julien@dremio.com> wrote:
>> 
>>> +1 on a repo for the spec.
>>> I do have questions as well.
>>> In particular for the metadata.
>>> 
>>> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <wes@cloudera.com> wrote:
>>> 
>>>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <parthc@apache.org>
>>>> wrote:
>>>>> 
>>>>> 
>>>>> On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <wes@cloudera.com>
>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> As far as the existing work is concerned, I'm not sure everyone
is
>>>> aware
>>>>>>> of
>>>>>>> the C++ code inside of Drill that can represent at least the
scalar
>>>>>>> types in
>>>>>>> Drill's existing Value Vectors [1]. This is currently used by
the
>>>> native
>>>>>>> client written to hook up an ODBC driver.
>>>>>>> 
>>>>>> 
>>>>>> I have read this code. From my perspective, it would be less work
to
>>>>>> collaborate on a self-contained implementation that closely models
the
>>>>>> Arrow / VV spec that includes builder classes and its own memory
>>>>>> management without coupling to Drill details. I started prototyping
>>>>>> something here (warning: only a few actual days of coding here):
>>>>>> 
>>>>>> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow
>>>>>> 
>>>>>> For example, you can see an example constructing an Array<Int32>
or
>>>>>> String (== Array<UInt8>) column in the tests here
>>>>>> 
>>>>>> 
>>>>>> 
>>>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328
>>>>>> 
>>>>>> I've been planning to use this as the basis of a C++ Parquet
>>>>>> reader-writer and the associated Python pandas-like layer which
>>>>>> includes in-memory analytics on Arrow data structures.
>>>>>> 
>>>>>>> Parth who is included here has been the primary owner of this
C++
>>>> code
>>>>>>> throughout it's life in Drill. Parth, what do you think is the
best
>>>>>>> strategy
>>>>>>> for managing the C++ code right now? As the C++ build is not
tied
>>>> into
>>>>>>> the
>>>>>>> Java one, as I understand it we just run it manually when updates
>>>> are
>>>>>>> made
>>>>>>> there and we need to update ODBC. Would it be disruptive to move
the
>>>>>>> code to
>>>>>>> the arrow repo? If so, we could include Drill as a submodule
in the
>>>> new
>>>>>>> repo, or put Wes's work so far in the Drill repo.
>>>>>> 
>>>>>> If we can enumerate the non-Drill-client related parts (i.e. the
array
>>>>>> accessors and data structures-oriented code) that would make sense
in
>>>>>> a standalone Arrow library it would be great to start a side
>>>>>> discussion about the design of the C++ reference implementation
>>>>>> (metadata / schemas, IPC, array builders and accessors, etc.). Since
>>>>>> this is a quite urgent for me (intending to deliver a minimally viable
>>>>>> pandas-like Arrow + Parquet in Python stack in the next ~3 months)
it
>>>>>> would be great to do this sooner rather than later.
>>>>>> 
>>>>> 
>>>>> Most of the code for  Drill C++ Value Vectors is independent of Drill
-
>>>>> mostly the code upto line 787 in this file -
>>>>> 
>>>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp
>>>>> 
>>>>> My thought was to leave the Drill implementation alone and borrow
>>>> copiously
>>>>> from it when convenient for Arrow. Seems like we can still do that
>>>> building
>>>>> on Wes' work.
>>>>> 
>>>> 
>>>> Makes sense. Speaking of code, would you all like me to set up a
>>>> temporary repo for the specification itself? I already have a few
>>>> questions like how and where to track array null counts.
>>>> 
>>>>> Wes, let me know if you want to have a quick hangout on this.
>>>>> 
>>>> 
>>>> Sure, I'll follow up separately to get something on the calendar.
>>>> Looking forward to connecting!
>>>> 
>>>>> Parth
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Julien
>>> 
>> 
>> 


Mime
View raw message