drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project
Date Mon, 26 Oct 2015 22:35:58 GMT
This sounds like a really good idea to me.



On Mon, Oct 26, 2015 at 2:50 PM, Julien Le Dem <julien@dremio.com> wrote:

> +1, looking forward to vectorized Parquet Readers/Writers in Drill.
> Making VV a standalone standard sounds great to me.
>
> On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <parthc@apache.org> wrote:
>
> > +1. Agree with Hanifi that we probably should have done this sooner :).
> > Jason and I faced this need when trying to get a stand alone vectorized
> > parquet reader out of the Drill code last year.
> >
> >
> >
> > On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hgunes@maprtech.com>
> wrote:
> >
> > > I was hoping to see this discussion happening sooner :) VVs has helped
> > > Drill representing and moving data around so flexibly that it would not
> > be
> > > hard to prove its usefulness to the community as a standalone library.
> I
> > am
> > > in support of this proposal.
> > >
> > >
> > > -Hanifi
> > >
> > > On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <jacques@dremio.com>
> > > wrote:
> > >
> > > > Drillers,
> > > >
> > > >
> > > >
> > > > A number of people have approached me recently about the possibility
> of
> > > > collaborating on a shared columnar in-memory representation of data.
> > This
> > > > shared representation of data could be operated on efficiently with
> > > modern
> > > > cpus as well as shared efficiently via shared memory, IPC and RPC.
> This
> > > > would allow multiple applications to work together at high speed.
> > > Examples
> > > > include moving back and forth between a library.
> > > >
> > > >
> > > >
> > > > As I was discussing these ideas with people working on projects
> > including
> > > > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from
> companies
> > > > like MapR and Trifacta, it became clear that much of what the Drill
> > > > community has already constructed is very relevant to the goals of a
> > new
> > > > broader interchange and execution format. (In fact, Ted and I
> actually
> > > > informally discussed extracting this functionality as a library more
> > than
> > > > two years ago.)
> > > >
> > > >
> > > >
> > > > A standard will emerge around this need and it is in the best
> interest
> > of
> > > > the Drill community and the broader ecosystem if Drill’s ValueVectors
> > > > concepts and code form the basis of a new
> > library/collaboration/project.
> > > > This means better interoperability, shared responsibility around
> > > > maintenance and development and the avoidance of further division of
> > the
> > > > ecosystem.
> > > >
> > > >
> > > >
> > > > A little background for some: Drill is the first project to create a
> > > > powerful language agnostic in-memory representation of complex
> columnar
> > > > data. We've learned a lot over the last three years about how to
> > > interface
> > > > with these structures, manage memory associated with them, adjust
> their
> > > > sizes, expose them in builder patterns, etc. That work is useful for
> a
> > > > number of systems and it would be great if we could share the
> learning.
> > > By
> > > > creating a new, well documented and collaborative library, people
> could
> > > > leverage this functionality in wider range of applications and
> systems.
> > > >
> > > >
> > > >
> > > > I’ve seen the great success that libraries like Parquet and Calcite
> > have
> > > > been able to achieve due to their focus on APIs, extensibility and
> > > > reusability and I think we could do the same with the Drill
> ValueVector
> > > > codebase. The fact that this would allow higher speed interchange
> among
> > > > many other systems and becoming the standard for in-memory columnar
> > > > exchange (as opposed to having to adopt an external standard) makes
> > this
> > > a
> > > > great opportunity to both benefit the Drill community and give back
> to
> > > the
> > > > broader Apache community.
> > > >
> > > >
> > > >
> > > > As such, I’d like to open a discussion about taking this path. I
> think
> > > > there would be various avenues of how to do this but my initial
> > proposal
> > > > would be to propose this as a new project that goes straight to a
> > > > provisional TLP. We then would work to clean up layer
> responsibilities
> > > and
> > > > extract pieces of the code into this new project where we collaborate
> > > with
> > > > a wider group on a broader implementation (and more formal
> > > specification).
> > > >
> > > >
> > > > Given the conversations I have had and the excitement and need for
> > this,
> > > I
> > > > think we should do this. If the community is supportive, we could
> > > probably
> > > > see some really cool integrations around things like high-speed
> Python
> > > > machine learning inside Drill operators before the end of the year.
> > > >
> > > >
> > > >
> > > > I’ll open a new JIRA and attach it here where we can start a POC &
> > > > discussion of how we could extract this code.
> > > >
> > > >
> > > > Looking forward to feedback!
> > > >
> > > >
> > > > Jacques
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message