Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B641C959 for ; Wed, 8 Aug 2012 22:25:33 +0000 (UTC) Received: (qmail 67216 invoked by uid 500); 8 Aug 2012 22:25:32 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 67023 invoked by uid 500); 8 Aug 2012 22:25:32 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 67015 invoked by uid 99); 8 Aug 2012 22:25:32 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 22:25:32 +0000 Received: from localhost (HELO mail-ob0-f175.google.com) (127.0.0.1) (smtp-auth username cdouglas, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 22:25:32 +0000 Received: by obc16 with SMTP id 16so1705802obc.6 for ; Wed, 08 Aug 2012 15:25:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.212.36 with SMTP id nh4mr32348944obc.37.1344464731386; Wed, 08 Aug 2012 15:25:31 -0700 (PDT) Received: by 10.182.7.226 with HTTP; Wed, 8 Aug 2012 15:25:31 -0700 (PDT) In-Reply-To: References: Date: Wed, 8 Aug 2012 15:25:31 -0700 Message-ID: Subject: Re: [PROPOSAL] Drill for the Apache Incubator From: Chris Douglas To: general@incubator.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable +1 -C On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning wrote: > This is a duplicated attempt at sending this message, please ignore the > previous message if it eventually arrives. There appears to be a hangup > sending email from my apache email address via gmail. > > Abstract > =3D=3D=3D=3D=3D=3D=3D=3D > Drill is a distributed system for interactive analysis of large-scale > datasets, inspired by Google=92s Dremel ( > http://research.google.com/pubs/pub36632.html). > > Proposal > =3D=3D=3D=3D=3D=3D=3D=3D > Drill is a distributed system for interactive analysis of large-scale > datasets. Drill is similar to Google=92s Dremel, with the additional > flexibility needed to support a broader range of query languages, data > formats and data sources. It is designed to efficiently process nested > data. It is a design goal to scale to 10,000 servers or more and to be ab= le > to process petabyes of data and trillions of records in seconds. > > Background > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Many organizations have the need to run data-intensive applications, > including batch processing, stream processing and interactive analysis. I= n > recent years open source systems have emerged to address the need for > scalable batch processing (Apache Hadoop) and stream processing (Storm, > Apache S4). In 2010 Google published a paper called =93Dremel: Interactiv= e > Analysis of Web-Scale Datasets,=94 describing a scalable system used > internally for interactive analysis of nested data. No open source projec= t > has successfully replicated the capabilities of Dremel. > > Rationale > =3D=3D=3D=3D=3D=3D=3D=3D=3D > There is a strong need in the market for low-latency interactive analysis > of large-scale datasets, including nested data (eg, JSON, Avro, Protocol > Buffers). This need was identified by Google and addressed internally wit= h > a system called Dremel. > > In recent years open source systems have emerged to address the need for > scalable batch processing (Apache Hadoop) and stream processing (Storm, > Apache S4). Apache Hadoop, originally inspired by Google=92s internal > MapReduce system, is used by thousands of organizations processing > large-scale datasets. Apache Hadoop is designed to achieve very high > throughput, but is not designed to achieve the sub-second latency needed > for interactive data analysis and exploration. Drill, inspired by Google= =92s > internal Dremel system, is intended to address this need. > > It is worth noting that, as explained by Google in the original paper, > Dremel complements MapReduce-based computing. Dremel is not intended as a > replacement for MapReduce and is often used in conjunction with it to > analyze outputs of MapReduce pipelines or rapidly prototype larger > computations. Indeed, Dremel and MapReduce are both used by thousands of > Google employees. > > Like Dremel, Drill supports a nested data model with data encoded in a > number of formats such as JSON, Avro or Protocol Buffers. In many > organizations nested data is the standard, so supporting a nested data > model eliminates the need to normalize the data. With that said, flat dat= a > formats, such as CSV files, are naturally supported as a special case of > nested data. > > The Drill architecture consists of four key components/layers: > * Query languages: This layer is responsible for parsing the user=92s que= ry > and constructing an execution plan. The initial goal is to support the > SQL-like language used by Dremel and Google BigQuery ( > https://developers.google.com/bigquery/docs/query-reference), which we ca= ll > DrQL. However, Drill is designed to support other languages and programmi= ng > models, such as the Mongo Query Language ( > http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( > http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). > * Low-latency distributed execution engine: This layer is responsible for > executing the physical plan. It provides the scalability and fault > tolerance needed to efficiently query petabytes of data on 10,000 servers= . > Drill=92s execution engine is based on research in distributed execution > engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar > storage, and can be extended with additional operators and connectors. > * Nested data formats: This layer is responsible for supporting various > data formats. The initial goal is to support the column-based format used > by Dremel. Drill is designed to support schema-based formats such as > Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less > formats such as JSON, BSON or YAML. In addition, it is designed to suppor= t > column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and > row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A > particular distinction with Drill is that the execution engine is flexibl= e > enough to support column-based processing as well as row-based processing= . > This is important because column-based processing can be much more > efficient when the data is stored in a column-based format, but many larg= e > data assets are stored in a row-based format that would require conversio= n > before use. > * Scalable data sources: This layer is responsible for supporting various > data sources. The initial focus is to leverage Hadoop as a data source. > > It is worth noting that no open source project has successfully replicate= d > the capabilities of Dremel, nor have any taken on the broader goals of > flexibility (eg, pluggable query languages, data formats, data sources an= d > execution engine operators/connectors) that are part of Drill. > > Initial Goals > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The initial goals for this project are to specify the detailed requiremen= ts > and architecture, and then develop the initial implementation including t= he > execution engine and DrQL. > Like Apache Hadoop, which was built to support multiple storage systems > (through the FileSystem API) and file formats (through the > InputFormat/OutputFormat APIs), Drill will be built to support multiple > query languages, data formats and data sources. The initial implementatio= n > of Drill will support the DrQL and a column-based format similar to Dreme= l. > > Current Status > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Significant work has been completed to identify the initial requirements > and define the overall system architecture. The next step is to implement > the four components described in the Rationale section, and we intend to = do > that development as an Apache project. > > Meritocracy > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > We plan to invest in supporting a meritocracy. We will discuss the > requirements in an open forum. Several companies have already expressed > interest in this project, and we intend to invite additional developers t= o > participate. We will encourage and monitor community participation so tha= t > privileges can be extended to those that contribute. Also, Drill has an > extensible/pluggable architecture that encourages developers to contribut= e > various extensions, such as query languages, data formats, data sources a= nd > execution engine operators and connectors. While some companies will sure= ly > develop commercial extensions, we also anticipate that some companies and > individuals will want to contribute such extensions back to the project, > and we look forward to fostering a rich ecosystem of extensions. > > Community > =3D=3D=3D=3D=3D=3D=3D=3D=3D > The need for a system for interactive analysis of large datasets in the > open source is tremendous, so there is a potential for a very large > community. We believe that Drill=92s extensible architecture will further > encourage community participation. Also, related Apache projects (eg, > Hadoop) have very large and active communities, and we expect that over > time Drill will also attract a large community. > > Core Developers > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The developers on the initial committers list include experienced > distributed systems engineers: > * Tomer Shiran has experience developing distributed execution engines. H= e > developed Parallel DataSeries, a data-parallel version of the open source > DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the > author of Applying Idealized Lower-bound Runtime Models to Understand > Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worke= d > as a software developer and researcher at IBM Research, Microsoft and HP > Labs, and is now at MapR Technologies. He has been active in the Hadoop > community since 2009. > * Jason Frantz was at Clustrix, where he designed and developed the first > scale-out SQL database based on MySQL. Jason developed the distributed > query optimizer that powered Clustrix. He is now a software engineer and > architect at MapR Technologies. > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and > has a history of over 30 years of contributions to open source. He is now > at MapR Technologies. Ted has been very active in the Hadoop community > since the project=92s early days. > * MC Srivas is the co-founder and CTO of MapR Technologies. While at Goog= le > he worked on Google=92s scalable search infrastructure. MC Srivas has bee= n > active in the Hadoop community since 2009. > * Chris Wensel is the founder and CEO of Concurrent. Prior to founding > Concurrent, he developed Cascading, an Apache-licensed open source > application framework enabling Java developers to quickly and easily > develop robust Data Analytics and Data Management applications on Apache > Hadoop. Chris has been involved in the Hadoop community since the project= 's > early days. > * Keys Botzum was at IBM, where he worked on security and distributed > systems, and is currently at MapR Technologies. > * Gera Shegalov was at Oracle, where he worked on networking, storage and > database kernels, and is currently at MapR Technologies. > * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed > Spire, a real-time operational database for Hadoop. He is also a committe= r > and PMC member for Apache HBase, and has a long history of contributions = to > open source. Ryan has been involved in the Hadoop community since the > project's early days. > > We realize that additional employer diversity is needed, and we will work > aggressively to recruit developers from additional companies. > > Alignment > =3D=3D=3D=3D=3D=3D=3D=3D=3D > The initial committers strongly believe that a system for interactive > analysis of large-scale datasets will gain broader adoption as an open > source, community driven project, where the community can contribute not > only to the core components, but also to a growing collection of query > languages and optimizers, data formats, data formats, and execution engin= e > operators and connectors. Drill will integrate closely with Apache Hadoop= . > First, the data will live in Hadoop. That is, Drill will support Hadoop > FileSystem implementations and HBase. Second, Hadoop-related data formats > will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools > will be provided to produce column-based formats. Fourth, Drill tables ca= n > be registered in HCatalog. Finally, Hive is being considered as the basis > of the DrQL implementation. > > Known Risks > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Orphaned Products > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The contributors are leading vendors in this space, with significant open > source experience, so the risk of being orphaned is relatively low. The > project could be at risk if vendors decided to change their strategies in > the market. In such an event, the current committers plan to continue > working on the project on their own time, though the progress will likely > be slower. We plan to mitigate this risk by recruiting additional > committers. > > Inexperience with Open Source > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > The initial committers include veteran Apache members (committers and PMC > members) and other developers who have varying degrees of experience with > open source projects. All have been involved with source code that has be= en > released under an open source license, and several also have experience > developing code with an open source development process. > > Homogenous Developers > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The initial committers are employed by a number of companies, including > MapR Technologies, Concurrent and Drawn to Scale. We are committed to > recruiting additional committers from other companies. > > Reliance on Salaried Developers > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > It is expected that Drill development will occur on both salaried time an= d > on volunteer time, after hours. The majority of initial committers are pa= id > by their employer to contribute to this project. However, they are all > passionate about the project, and we are confident that the project will > continue even if no salaried developers contribute to the project. We are > committed to recruiting additional committers including non-salaried > developers. > > Relationships with Other Apache Products > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > As mentioned in the Alignment section, Drill is closely integrated with > Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data > lives inside a Hadoop environment (Drill operates on in situ data). We lo= ok > forward to collaborating with those communities, as well as other Apache > communities. > > An Excessive Fascination with the Apache Brand > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Drill solves a real problem that many organizations struggle with, and ha= s > been proven within Google to be of significant value. The architecture is > based on academic and industry research. Our rationale for developing Dri= ll > as an Apache project is detailed in the Rationale section. We believe tha= t > the Apache brand and community process will help us attract more > contributors to this project, and help establish ubiquitous APIs. In > addition, establishing consensus among users and developers of a > Dremel-like tool is a key requirement for success of the project. > > Documentation > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Drill is inspired by Google=92s Dremel. Google has published a paper > highlighting Dremel=92s innovative nested column-based data format and > execution engine: http://research.google.com/pubs/pub36632.html > > High-level slides have been published by MapR: TODO > > Initial Source > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > There is no initial source code. All source code will be developed within > the Apache Incubator. > > Cryptography > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Drill will eventually support encryption on the wire. This is not one of > the initial goals, and we do not expect Drill to be a controlled export > item due to the use of encryption. > > Required Resources > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Mailing List > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > * drill-private > * drill-dev > * drill-user > > Subversion Directory > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Git is the preferred source control system: git://git.apache.org/drill > > Issue Tracking > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > JIRA Drill (DRILL) > > Initial Committers > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > * Tomer Shiran (tshiran at maprtech dot com) > * Ted Dunning (tdunning at apache dot org) > * Jason Frantz (jfrantz at maprtech dot com) > * MC Srivas (mcsrivas at maprtech dot com) > * Chris Wensel (chris and concurrentinc dot com) > * Keys Botzum (kbotzum at maprtech dot com) > * Gera Shegalov (gshegalov at maprtech dot com) > * Ryan Rawson (ryan at drawntoscale dot com) > > Affiliations > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The initial committers are employees of MapR Technologies, Drawn to Scale > and Concurrent. The nominated mentors are employees of MapR Technologies, > Lucid Imagination and Nokia. > > Sponsors > =3D=3D=3D=3D=3D=3D=3D=3D > > Champion > =3D=3D=3D=3D=3D=3D=3D=3D > Ted Dunning (tdunning at apache dot org) > > Nominated Mentors > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > * Ted Dunning (tdunning at apache dot org) =96 Chief Application Architec= t at > MapR Technologies, Committer for Lucene, Mahout and ZooKeeper. > * Grant Ingersoll (grant at lucidimagination dot com) =96 Chief Scientist= at > Lucid Imagination, Committer for Lucene, Mahout and other projects. > * Isabel Drost (Isabel at apache dot org) =96 Software Developer at Nokia > Gate 5 GmbH, Committer for Lucene, Mahout and other projects. > > Sponsoring Entity > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Incubator --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org