drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project
Date Mon, 26 Oct 2015 21:50:33 GMT
+1, looking forward to vectorized Parquet Readers/Writers in Drill.
Making VV a standalone standard sounds great to me.

On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <parthc@apache.org> wrote:

> +1. Agree with Hanifi that we probably should have done this sooner :).
> Jason and I faced this need when trying to get a stand alone vectorized
> parquet reader out of the Drill code last year.
>
>
>
> On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hgunes@maprtech.com> wrote:
>
> > I was hoping to see this discussion happening sooner :) VVs has helped
> > Drill representing and moving data around so flexibly that it would not
> be
> > hard to prove its usefulness to the community as a standalone library. I
> am
> > in support of this proposal.
> >
> >
> > -Hanifi
> >
> > On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <jacques@dremio.com>
> > wrote:
> >
> > > Drillers,
> > >
> > >
> > >
> > > A number of people have approached me recently about the possibility of
> > > collaborating on a shared columnar in-memory representation of data.
> This
> > > shared representation of data could be operated on efficiently with
> > modern
> > > cpus as well as shared efficiently via shared memory, IPC and RPC. This
> > > would allow multiple applications to work together at high speed.
> > Examples
> > > include moving back and forth between a library.
> > >
> > >
> > >
> > > As I was discussing these ideas with people working on projects
> including
> > > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
> > > like MapR and Trifacta, it became clear that much of what the Drill
> > > community has already constructed is very relevant to the goals of a
> new
> > > broader interchange and execution format. (In fact, Ted and I actually
> > > informally discussed extracting this functionality as a library more
> than
> > > two years ago.)
> > >
> > >
> > >
> > > A standard will emerge around this need and it is in the best interest
> of
> > > the Drill community and the broader ecosystem if Drill’s ValueVectors
> > > concepts and code form the basis of a new
> library/collaboration/project.
> > > This means better interoperability, shared responsibility around
> > > maintenance and development and the avoidance of further division of
> the
> > > ecosystem.
> > >
> > >
> > >
> > > A little background for some: Drill is the first project to create a
> > > powerful language agnostic in-memory representation of complex columnar
> > > data. We've learned a lot over the last three years about how to
> > interface
> > > with these structures, manage memory associated with them, adjust their
> > > sizes, expose them in builder patterns, etc. That work is useful for a
> > > number of systems and it would be great if we could share the learning.
> > By
> > > creating a new, well documented and collaborative library, people could
> > > leverage this functionality in wider range of applications and systems.
> > >
> > >
> > >
> > > I’ve seen the great success that libraries like Parquet and Calcite
> have
> > > been able to achieve due to their focus on APIs, extensibility and
> > > reusability and I think we could do the same with the Drill ValueVector
> > > codebase. The fact that this would allow higher speed interchange among
> > > many other systems and becoming the standard for in-memory columnar
> > > exchange (as opposed to having to adopt an external standard) makes
> this
> > a
> > > great opportunity to both benefit the Drill community and give back to
> > the
> > > broader Apache community.
> > >
> > >
> > >
> > > As such, I’d like to open a discussion about taking this path. I think
> > > there would be various avenues of how to do this but my initial
> proposal
> > > would be to propose this as a new project that goes straight to a
> > > provisional TLP. We then would work to clean up layer responsibilities
> > and
> > > extract pieces of the code into this new project where we collaborate
> > with
> > > a wider group on a broader implementation (and more formal
> > specification).
> > >
> > >
> > > Given the conversations I have had and the excitement and need for
> this,
> > I
> > > think we should do this. If the community is supportive, we could
> > probably
> > > see some really cool integrations around things like high-speed Python
> > > machine learning inside Drill operators before the end of the year.
> > >
> > >
> > >
> > > I’ll open a new JIRA and attach it here where we can start a POC &
> > > discussion of how we could extract this code.
> > >
> > >
> > > Looking forward to feedback!
> > >
> > >
> > > Jacques
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> >
>



-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message