Mailing-List: contact general-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAJwFCa2EkFhGvEgaoJV3b+ytroBBw5z=bo+7Otz=qD9WAPPEPA@mail.gmail.com>
References: 
 <CAJwFCa2EkFhGvEgaoJV3b+ytroBBw5z=bo+7Otz=qD9WAPPEPA@mail.gmail.com>
Date: Wed, 8 Aug 2012 15:25:31 -0700
Message-ID: 
 <CACO5Y4xgWuUwwGEk3R5+P9pLydOg1usFbRcnKmJxya3ZDuksgA@mail.gmail.com>
Subject: Re: [PROPOSAL] Drill for the Apache Incubator
From: Chris Douglas <cdouglas@apache.org>
To: general@incubator.apache.org
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

+1 -C

On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> This is a duplicated attempt at sending this message, please ignore the
> previous message if it eventually arrives.  There appears to be a hangup
> sending email from my apache email address via gmail.
>
> Abstract
> =3D=3D=3D=3D=3D=3D=3D=3D
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by Google=92s Dremel (
> http://research.google.com/pubs/pub36632.html).
>
> Proposal
> =3D=3D=3D=3D=3D=3D=3D=3D
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google=92s Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be ab=
le
> to process petabyes of data and trillions of records in seconds.
>
> Background
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive analysis. I=
n
> recent years open source systems have emerged to address the need for
> scalable batch processing (Apache Hadoop) and stream processing (Storm,
> Apache S4). In 2010 Google published a paper called =93Dremel: Interactiv=
e
> Analysis of Web-Scale Datasets,=94 describing a scalable system used
> internally for interactive analysis of nested data. No open source projec=
t
> has successfully replicated the capabilities of Dremel.
>
> Rationale
> =3D=3D=3D=3D=3D=3D=3D=3D=3D
> There is a strong need in the market for low-latency interactive analysis
> of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
> Buffers). This need was identified by Google and addressed internally wit=
h
> a system called Dremel.
>
> In recent years open source systems have emerged to address the need for
> scalable batch processing (Apache Hadoop) and stream processing (Storm,
> Apache S4). Apache Hadoop, originally inspired by Google=92s internal
> MapReduce system, is used by thousands of organizations processing
> large-scale datasets. Apache Hadoop is designed to achieve very high
> throughput, but is not designed to achieve the sub-second latency needed
> for interactive data analysis and exploration. Drill, inspired by Google=
=92s
> internal Dremel system, is intended to address this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended as a
> replacement for MapReduce and is often used in conjunction with it to
> analyze outputs of MapReduce pipelines or rapidly prototype larger
> computations. Indeed, Dremel and MapReduce are both used by thousands of
> Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat dat=
a
> formats, such as CSV files, are naturally supported as a special case of
> nested data.
>
> The Drill architecture consists of four key components/layers:
> * Query languages: This layer is responsible for parsing the user=92s que=
ry
> and constructing an execution plan.  The initial goal is to support the
> SQL-like language used by Dremel and Google BigQuery (
> https://developers.google.com/bigquery/docs/query-reference), which we ca=
ll
> DrQL. However, Drill is designed to support other languages and programmi=
ng
> models, such as the Mongo Query Language (
> http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
> http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
> * Low-latency distributed execution engine: This layer is responsible for
> executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000 servers=
.
> Drill=92s execution engine is based on research in distributed execution
> engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
> storage, and can be extended with additional operators and connectors.
> * Nested data formats: This layer is responsible for supporting various
> data formats. The initial goal is to support the column-based format used
> by Dremel. Drill is designed to support schema-based formats such as
> Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
> formats such as JSON, BSON or YAML. In addition, it is designed to suppor=
t
> column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
> row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
> particular distinction with Drill is that the execution engine is flexibl=
e
> enough to support column-based processing as well as row-based processing=
.
> This is important because column-based processing can be much more
> efficient when the data is stored in a column-based format, but many larg=
e
> data assets are stored in a row-based format that would require conversio=
n
> before use.
> * Scalable data sources: This layer is responsible for supporting various
> data sources. The initial focus is to leverage Hadoop as a data source.
>
> It is worth noting that no open source project has successfully replicate=
d
> the capabilities of Dremel, nor have any taken on the broader goals of
> flexibility (eg, pluggable query languages, data formats, data sources an=
d
> execution engine operators/connectors) that are part of Drill.
>
> Initial Goals
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> The initial goals for this project are to specify the detailed requiremen=
ts
> and architecture, and then develop the initial implementation including t=
he
> execution engine and DrQL.
> Like Apache Hadoop, which was built to support multiple storage systems
> (through the FileSystem API) and file formats (through the
> InputFormat/OutputFormat APIs), Drill will be built to support multiple
> query languages, data formats and data sources. The initial implementatio=
n
> of Drill will support the DrQL and a column-based format similar to Dreme=
l.
>
> Current Status
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Significant work has been completed to identify the initial requirements
> and define the overall system architecture. The next step is to implement
> the four components described in the Rationale section, and we intend to =
do
> that development as an Apache project.
>
> Meritocracy
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers t=
o
> participate. We will encourage and monitor community participation so tha=
t
> privileges can be extended to those that contribute. Also, Drill has an
> extensible/pluggable architecture that encourages developers to contribut=
e
> various extensions, such as query languages, data formats, data sources a=
nd
> execution engine operators and connectors. While some companies will sure=
ly
> develop commercial extensions, we also anticipate that some companies and
> individuals will want to contribute such extensions back to the project,
> and we look forward to fostering a rich ecosystem of extensions.
>
> Community
> =3D=3D=3D=3D=3D=3D=3D=3D=3D
> The need for a system for interactive analysis of large datasets in the
> open source is tremendous, so there is a potential for a very large
> community. We believe that Drill=92s extensible architecture will further
> encourage community participation. Also, related Apache projects (eg,
> Hadoop) have very large and active communities, and we expect that over
> time Drill will also attract a large community.
>
> Core Developers
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> The developers on the initial committers list include experienced
> distributed systems engineers:
> * Tomer Shiran has experience developing distributed execution engines. H=
e
> developed Parallel DataSeries, a data-parallel version of the open source
> DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the
> author of Applying Idealized Lower-bound Runtime Models to Understand
> Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worke=
d
> as a software developer and researcher at IBM Research, Microsoft and HP
> Labs, and is now at MapR Technologies. He has been active in the Hadoop
> community since 2009.
> * Jason Frantz was at Clustrix, where he designed and developed the first
> scale-out SQL database based on MySQL. Jason developed the distributed
> query optimizer that powered Clustrix. He is now a software engineer and
> architect at MapR Technologies.
> * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and
> has a history of over 30 years of contributions to open source. He is now
> at MapR Technologies. Ted has been very active in the Hadoop community
> since the project=92s early days.
> * MC Srivas is the co-founder and CTO of MapR Technologies. While at Goog=
le
> he worked on Google=92s scalable search infrastructure. MC Srivas has bee=
n
> active in the Hadoop community since 2009.
> * Chris Wensel is the founder and CEO of Concurrent. Prior to founding
> Concurrent, he developed Cascading, an Apache-licensed open source
> application framework enabling Java developers to quickly and easily
> develop robust Data Analytics and Data Management applications on Apache
> Hadoop. Chris has been involved in the Hadoop community since the project=
's
> early days.
> * Keys Botzum was at IBM, where he worked on security and distributed
> systems, and is currently at MapR Technologies.
> * Gera Shegalov was at Oracle, where he worked on networking, storage and
> database kernels, and is currently at MapR Technologies.
> * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed
> Spire, a real-time operational database for Hadoop. He is also a committe=
r
> and PMC member for Apache HBase, and has a long history of contributions =
to
> open source. Ryan has been involved in the Hadoop community since the
> project's early days.
>
> We realize that additional employer diversity is needed, and we will work
> aggressively to recruit developers from additional companies.
>
> Alignment
> =3D=3D=3D=3D=3D=3D=3D=3D=3D
> The initial committers strongly believe that a system for interactive
> analysis of large-scale datasets will gain broader adoption as an open
> source, community driven project, where the community can contribute not
> only to the core components, but also to a growing collection of query
> languages and optimizers, data formats, data formats, and execution engin=
e
> operators and connectors. Drill will integrate closely with Apache Hadoop=
.
> First, the data will live in Hadoop. That is, Drill will support Hadoop
> FileSystem implementations and HBase. Second, Hadoop-related data formats
> will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools
> will be provided to produce column-based formats. Fourth, Drill tables ca=
n
> be registered in HCatalog. Finally, Hive is being considered as the basis
> of the DrQL implementation.
>
> Known Risks
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> Orphaned Products
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> The contributors are leading vendors in this space, with significant open
> source experience, so the risk of being orphaned is relatively low. The
> project could be at risk if vendors decided to change their strategies in
> the market. In such an event, the current committers plan to continue
> working on the project on their own time, though the progress will likely
> be slower. We plan to mitigate this risk by recruiting additional
> committers.
>
> Inexperience with Open Source
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
> The initial committers include veteran Apache members (committers and PMC
> members) and other developers who have varying degrees of experience with
> open source projects. All have been involved with source code that has be=
en
> released under an open source license, and several also have experience
> developing code with an open source development process.
>
> Homogenous Developers
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> The initial committers are employed by a number of companies, including
> MapR Technologies, Concurrent and Drawn to Scale. We are committed to
> recruiting additional committers from other companies.
>
> Reliance on Salaried Developers
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
> It is expected that Drill development will occur on both salaried time an=
d
> on volunteer time, after hours. The majority of initial committers are pa=
id
> by their employer to contribute to this project. However, they are all
> passionate about the project, and we are confident that the project will
> continue even if no salaried developers contribute to the project. We are
> committed to recruiting additional committers including non-salaried
> developers.
>
> Relationships with Other Apache Products
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> As mentioned in the Alignment section, Drill is closely integrated with
> Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data
> lives inside a Hadoop environment (Drill operates on in situ data). We lo=
ok
> forward to collaborating with those communities, as well as other Apache
> communities.
>
> An Excessive Fascination with the Apache Brand
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Drill solves a real problem that many organizations struggle with, and ha=
s
> been proven within Google to be of significant value. The architecture is
> based on academic and industry research. Our rationale for developing Dri=
ll
> as an Apache project is detailed in the Rationale section. We believe tha=
t
> the Apache brand and community process will help us attract more
> contributors to this project, and help establish ubiquitous APIs. In
> addition, establishing consensus among users and developers of a
> Dremel-like tool is a key requirement for success of the project.
>
> Documentation
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Drill is inspired by Google=92s Dremel. Google has published a paper
> highlighting Dremel=92s innovative nested column-based data format and
> execution engine: http://research.google.com/pubs/pub36632.html
>
> High-level slides have been published by MapR: TODO
>
> Initial Source
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> There is no initial source code. All source code will be developed within
> the Apache Incubator.
>
> Cryptography
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Drill will eventually support encryption on the wire. This is not one of
> the initial goals, and we do not expect Drill to be a controlled export
> item due to the use of encryption.
>
> Required Resources
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> Mailing List
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> * drill-private
> * drill-dev
> * drill-user
>
> Subversion Directory
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Git is the preferred source control system: git://git.apache.org/drill
>
> Issue Tracking
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> JIRA Drill (DRILL)
>
> Initial Committers
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> * Tomer Shiran (tshiran at maprtech dot com)
> * Ted Dunning (tdunning at apache dot org)
> * Jason Frantz (jfrantz at maprtech dot com)
> * MC Srivas (mcsrivas at maprtech dot com)
> * Chris Wensel (chris and concurrentinc dot com)
> * Keys Botzum (kbotzum at maprtech dot com)
> * Gera Shegalov (gshegalov at maprtech dot com)
> * Ryan Rawson (ryan at drawntoscale dot com)
>
> Affiliations
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> The initial committers are employees of MapR Technologies, Drawn to Scale
> and Concurrent. The nominated mentors are employees of MapR Technologies,
> Lucid Imagination and Nokia.
>
> Sponsors
> =3D=3D=3D=3D=3D=3D=3D=3D
>
> Champion
> =3D=3D=3D=3D=3D=3D=3D=3D
> Ted Dunning (tdunning at apache dot org)
>
> Nominated Mentors
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> * Ted Dunning (tdunning at apache dot org) =96 Chief Application Architec=
t at
> MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
> * Grant Ingersoll (grant at lucidimagination dot com) =96 Chief Scientist=
 at
> Lucid Imagination, Committer for Lucene, Mahout and other projects.
> * Isabel Drost (Isabel at apache dot org) =96 Software Developer at Nokia
> Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
>
> Sponsoring Entity
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org