drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parth Chandra <par...@apache.org>
Subject Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project
Date Mon, 26 Oct 2015 21:46:45 GMT
+1. Agree with Hanifi that we probably should have done this sooner :).
Jason and I faced this need when trying to get a stand alone vectorized
parquet reader out of the Drill code last year.



On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hgunes@maprtech.com> wrote:

> I was hoping to see this discussion happening sooner :) VVs has helped
> Drill representing and moving data around so flexibly that it would not be
> hard to prove its usefulness to the community as a standalone library. I am
> in support of this proposal.
>
>
> -Hanifi
>
> On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <jacques@dremio.com>
> wrote:
>
> > Drillers,
> >
> >
> >
> > A number of people have approached me recently about the possibility of
> > collaborating on a shared columnar in-memory representation of data. This
> > shared representation of data could be operated on efficiently with
> modern
> > cpus as well as shared efficiently via shared memory, IPC and RPC. This
> > would allow multiple applications to work together at high speed.
> Examples
> > include moving back and forth between a library.
> >
> >
> >
> > As I was discussing these ideas with people working on projects including
> > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
> > like MapR and Trifacta, it became clear that much of what the Drill
> > community has already constructed is very relevant to the goals of a new
> > broader interchange and execution format. (In fact, Ted and I actually
> > informally discussed extracting this functionality as a library more than
> > two years ago.)
> >
> >
> >
> > A standard will emerge around this need and it is in the best interest of
> > the Drill community and the broader ecosystem if Drill’s ValueVectors
> > concepts and code form the basis of a new library/collaboration/project.
> > This means better interoperability, shared responsibility around
> > maintenance and development and the avoidance of further division of the
> > ecosystem.
> >
> >
> >
> > A little background for some: Drill is the first project to create a
> > powerful language agnostic in-memory representation of complex columnar
> > data. We've learned a lot over the last three years about how to
> interface
> > with these structures, manage memory associated with them, adjust their
> > sizes, expose them in builder patterns, etc. That work is useful for a
> > number of systems and it would be great if we could share the learning.
> By
> > creating a new, well documented and collaborative library, people could
> > leverage this functionality in wider range of applications and systems.
> >
> >
> >
> > I’ve seen the great success that libraries like Parquet and Calcite have
> > been able to achieve due to their focus on APIs, extensibility and
> > reusability and I think we could do the same with the Drill ValueVector
> > codebase. The fact that this would allow higher speed interchange among
> > many other systems and becoming the standard for in-memory columnar
> > exchange (as opposed to having to adopt an external standard) makes this
> a
> > great opportunity to both benefit the Drill community and give back to
> the
> > broader Apache community.
> >
> >
> >
> > As such, I’d like to open a discussion about taking this path. I think
> > there would be various avenues of how to do this but my initial proposal
> > would be to propose this as a new project that goes straight to a
> > provisional TLP. We then would work to clean up layer responsibilities
> and
> > extract pieces of the code into this new project where we collaborate
> with
> > a wider group on a broader implementation (and more formal
> specification).
> >
> >
> > Given the conversations I have had and the excitement and need for this,
> I
> > think we should do this. If the community is supportive, we could
> probably
> > see some really cool integrations around things like high-speed Python
> > machine learning inside Drill operators before the end of the year.
> >
> >
> >
> > I’ll open a new JIRA and attach it here where we can start a POC &
> > discussion of how we could extract this code.
> >
> >
> > Looking forward to feedback!
> >
> >
> > Jacques
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message