drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project
Date Mon, 26 Oct 2015 21:19:44 GMT
Drillers,



A number of people have approached me recently about the possibility of
collaborating on a shared columnar in-memory representation of data. This
shared representation of data could be operated on efficiently with modern
cpus as well as shared efficiently via shared memory, IPC and RPC. This
would allow multiple applications to work together at high speed. Examples
include moving back and forth between a library.



As I was discussing these ideas with people working on projects including
Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
like MapR and Trifacta, it became clear that much of what the Drill
community has already constructed is very relevant to the goals of a new
broader interchange and execution format. (In fact, Ted and I actually
informally discussed extracting this functionality as a library more than
two years ago.)



A standard will emerge around this need and it is in the best interest of
the Drill community and the broader ecosystem if Drill’s ValueVectors
concepts and code form the basis of a new library/collaboration/project.
This means better interoperability, shared responsibility around
maintenance and development and the avoidance of further division of the
ecosystem.



A little background for some: Drill is the first project to create a
powerful language agnostic in-memory representation of complex columnar
data. We've learned a lot over the last three years about how to interface
with these structures, manage memory associated with them, adjust their
sizes, expose them in builder patterns, etc. That work is useful for a
number of systems and it would be great if we could share the learning. By
creating a new, well documented and collaborative library, people could
leverage this functionality in wider range of applications and systems.



I’ve seen the great success that libraries like Parquet and Calcite have
been able to achieve due to their focus on APIs, extensibility and
reusability and I think we could do the same with the Drill ValueVector
codebase. The fact that this would allow higher speed interchange among
many other systems and becoming the standard for in-memory columnar
exchange (as opposed to having to adopt an external standard) makes this a
great opportunity to both benefit the Drill community and give back to the
broader Apache community.



As such, I’d like to open a discussion about taking this path. I think
there would be various avenues of how to do this but my initial proposal
would be to propose this as a new project that goes straight to a
provisional TLP. We then would work to clean up layer responsibilities and
extract pieces of the code into this new project where we collaborate with
a wider group on a broader implementation (and more formal specification).


Given the conversations I have had and the excitement and need for this, I
think we should do this. If the community is supportive, we could probably
see some really cool integrations around things like high-speed Python
machine learning inside Drill operators before the end of the year.



I’ll open a new JIRA and attach it here where we can start a POC &
discussion of how we could extract this code.


Looking forward to feedback!


Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message