Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D9567C05B for ; Thu, 9 Aug 2012 04:15:16 +0000 (UTC) Received: (qmail 64736 invoked by uid 500); 9 Aug 2012 04:15:16 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 64052 invoked by uid 500); 9 Aug 2012 04:15:08 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 64000 invoked by uid 99); 9 Aug 2012 04:15:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Aug 2012 04:15:07 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jghoman@gmail.com designates 74.125.82.43 as permitted sender) Received: from [74.125.82.43] (HELO mail-wg0-f43.google.com) (74.125.82.43) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Aug 2012 04:15:01 +0000 Received: by wgbdr1 with SMTP id dr1so21464wgb.0 for ; Wed, 08 Aug 2012 21:14:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=C67XCoeH8Y++PSxxr0isBtWRqQtFG1rdjBnaqJLtdiw=; b=bOFG0EatMFDcP/MHVaBjpUObI9z1ccsvP9Pvm2wDzEpmuqOXs2qFGxA0QLPK1H+HtE VDBiklZXao5QHZPGChRSCxYjq1ILy+CZ+13g+ofXCETNlUHbL0+excyGRC6fNFkCe1Fk +FxRTPp/I14wIQvTqsomrZcB7YJ6YW7EZjdyi/qi7OoY5I5P5ZYLCVid7Ojbfg6xK3ZK +TjHkDx7tHdw+aF7K3ES3HNXc83qpYQbycjnjcrJ7N7h7QNOGJHZIiW2HEGoUfe/Z4SN +6NKcJdy/LuQDImgVta8LRiKizR+EIfYTEUBJ/BEl2jDQHFUQqueXiIvDmsL2vvJN8pl pY4w== Received: by 10.180.78.5 with SMTP id x5mr3361098wiw.13.1344485680873; Wed, 08 Aug 2012 21:14:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.92.103 with HTTP; Wed, 8 Aug 2012 21:14:10 -0700 (PDT) In-Reply-To: References: From: Jakob Homan Date: Wed, 8 Aug 2012 21:14:10 -0700 Message-ID: Subject: Re: [PROPOSAL] Drill for the Apache Incubator To: general@incubator.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable So, no response to my request above about the design docs and not-TO-DOne MapR presentation? On Wed, Aug 8, 2012 at 3:25 PM, Chris Douglas wrote: > +1 -C > > On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning wrote= : >> This is a duplicated attempt at sending this message, please ignore the >> previous message if it eventually arrives. There appears to be a hangup >> sending email from my apache email address via gmail. >> >> Abstract >> =3D=3D=3D=3D=3D=3D=3D=3D >> Drill is a distributed system for interactive analysis of large-scale >> datasets, inspired by Google=92s Dremel ( >> http://research.google.com/pubs/pub36632.html). >> >> Proposal >> =3D=3D=3D=3D=3D=3D=3D=3D >> Drill is a distributed system for interactive analysis of large-scale >> datasets. Drill is similar to Google=92s Dremel, with the additional >> flexibility needed to support a broader range of query languages, data >> formats and data sources. It is designed to efficiently process nested >> data. It is a design goal to scale to 10,000 servers or more and to be a= ble >> to process petabyes of data and trillions of records in seconds. >> >> Background >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Many organizations have the need to run data-intensive applications, >> including batch processing, stream processing and interactive analysis. = In >> recent years open source systems have emerged to address the need for >> scalable batch processing (Apache Hadoop) and stream processing (Storm, >> Apache S4). In 2010 Google published a paper called =93Dremel: Interacti= ve >> Analysis of Web-Scale Datasets,=94 describing a scalable system used >> internally for interactive analysis of nested data. No open source proje= ct >> has successfully replicated the capabilities of Dremel. >> >> Rationale >> =3D=3D=3D=3D=3D=3D=3D=3D=3D >> There is a strong need in the market for low-latency interactive analysi= s >> of large-scale datasets, including nested data (eg, JSON, Avro, Protocol >> Buffers). This need was identified by Google and addressed internally wi= th >> a system called Dremel. >> >> In recent years open source systems have emerged to address the need for >> scalable batch processing (Apache Hadoop) and stream processing (Storm, >> Apache S4). Apache Hadoop, originally inspired by Google=92s internal >> MapReduce system, is used by thousands of organizations processing >> large-scale datasets. Apache Hadoop is designed to achieve very high >> throughput, but is not designed to achieve the sub-second latency needed >> for interactive data analysis and exploration. Drill, inspired by Google= =92s >> internal Dremel system, is intended to address this need. >> >> It is worth noting that, as explained by Google in the original paper, >> Dremel complements MapReduce-based computing. Dremel is not intended as = a >> replacement for MapReduce and is often used in conjunction with it to >> analyze outputs of MapReduce pipelines or rapidly prototype larger >> computations. Indeed, Dremel and MapReduce are both used by thousands of >> Google employees. >> >> Like Dremel, Drill supports a nested data model with data encoded in a >> number of formats such as JSON, Avro or Protocol Buffers. In many >> organizations nested data is the standard, so supporting a nested data >> model eliminates the need to normalize the data. With that said, flat da= ta >> formats, such as CSV files, are naturally supported as a special case of >> nested data. >> >> The Drill architecture consists of four key components/layers: >> * Query languages: This layer is responsible for parsing the user=92s qu= ery >> and constructing an execution plan. The initial goal is to support the >> SQL-like language used by Dremel and Google BigQuery ( >> https://developers.google.com/bigquery/docs/query-reference), which we c= all >> DrQL. However, Drill is designed to support other languages and programm= ing >> models, such as the Mongo Query Language ( >> http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( >> http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). >> * Low-latency distributed execution engine: This layer is responsible fo= r >> executing the physical plan. It provides the scalability and fault >> tolerance needed to efficiently query petabytes of data on 10,000 server= s. >> Drill=92s execution engine is based on research in distributed execution >> engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar >> storage, and can be extended with additional operators and connectors. >> * Nested data formats: This layer is responsible for supporting various >> data formats. The initial goal is to support the column-based format use= d >> by Dremel. Drill is designed to support schema-based formats such as >> Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less >> formats such as JSON, BSON or YAML. In addition, it is designed to suppo= rt >> column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and >> row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A >> particular distinction with Drill is that the execution engine is flexib= le >> enough to support column-based processing as well as row-based processin= g. >> This is important because column-based processing can be much more >> efficient when the data is stored in a column-based format, but many lar= ge >> data assets are stored in a row-based format that would require conversi= on >> before use. >> * Scalable data sources: This layer is responsible for supporting variou= s >> data sources. The initial focus is to leverage Hadoop as a data source. >> >> It is worth noting that no open source project has successfully replicat= ed >> the capabilities of Dremel, nor have any taken on the broader goals of >> flexibility (eg, pluggable query languages, data formats, data sources a= nd >> execution engine operators/connectors) that are part of Drill. >> >> Initial Goals >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> The initial goals for this project are to specify the detailed requireme= nts >> and architecture, and then develop the initial implementation including = the >> execution engine and DrQL. >> Like Apache Hadoop, which was built to support multiple storage systems >> (through the FileSystem API) and file formats (through the >> InputFormat/OutputFormat APIs), Drill will be built to support multiple >> query languages, data formats and data sources. The initial implementati= on >> of Drill will support the DrQL and a column-based format similar to Drem= el. >> >> Current Status >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Significant work has been completed to identify the initial requirements >> and define the overall system architecture. The next step is to implemen= t >> the four components described in the Rationale section, and we intend to= do >> that development as an Apache project. >> >> Meritocracy >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> We plan to invest in supporting a meritocracy. We will discuss the >> requirements in an open forum. Several companies have already expressed >> interest in this project, and we intend to invite additional developers = to >> participate. We will encourage and monitor community participation so th= at >> privileges can be extended to those that contribute. Also, Drill has an >> extensible/pluggable architecture that encourages developers to contribu= te >> various extensions, such as query languages, data formats, data sources = and >> execution engine operators and connectors. While some companies will sur= ely >> develop commercial extensions, we also anticipate that some companies an= d >> individuals will want to contribute such extensions back to the project, >> and we look forward to fostering a rich ecosystem of extensions. >> >> Community >> =3D=3D=3D=3D=3D=3D=3D=3D=3D >> The need for a system for interactive analysis of large datasets in the >> open source is tremendous, so there is a potential for a very large >> community. We believe that Drill=92s extensible architecture will furthe= r >> encourage community participation. Also, related Apache projects (eg, >> Hadoop) have very large and active communities, and we expect that over >> time Drill will also attract a large community. >> >> Core Developers >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> The developers on the initial committers list include experienced >> distributed systems engineers: >> * Tomer Shiran has experience developing distributed execution engines. = He >> developed Parallel DataSeries, a data-parallel version of the open sourc= e >> DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the >> author of Applying Idealized Lower-bound Runtime Models to Understand >> Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer work= ed >> as a software developer and researcher at IBM Research, Microsoft and HP >> Labs, and is now at MapR Technologies. He has been active in the Hadoop >> community since 2009. >> * Jason Frantz was at Clustrix, where he designed and developed the firs= t >> scale-out SQL database based on MySQL. Jason developed the distributed >> query optimizer that powered Clustrix. He is now a software engineer and >> architect at MapR Technologies. >> * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, an= d >> has a history of over 30 years of contributions to open source. He is no= w >> at MapR Technologies. Ted has been very active in the Hadoop community >> since the project=92s early days. >> * MC Srivas is the co-founder and CTO of MapR Technologies. While at Goo= gle >> he worked on Google=92s scalable search infrastructure. MC Srivas has be= en >> active in the Hadoop community since 2009. >> * Chris Wensel is the founder and CEO of Concurrent. Prior to founding >> Concurrent, he developed Cascading, an Apache-licensed open source >> application framework enabling Java developers to quickly and easily >> develop robust Data Analytics and Data Management applications on Apache >> Hadoop. Chris has been involved in the Hadoop community since the projec= t's >> early days. >> * Keys Botzum was at IBM, where he worked on security and distributed >> systems, and is currently at MapR Technologies. >> * Gera Shegalov was at Oracle, where he worked on networking, storage an= d >> database kernels, and is currently at MapR Technologies. >> * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed >> Spire, a real-time operational database for Hadoop. He is also a committ= er >> and PMC member for Apache HBase, and has a long history of contributions= to >> open source. Ryan has been involved in the Hadoop community since the >> project's early days. >> >> We realize that additional employer diversity is needed, and we will wor= k >> aggressively to recruit developers from additional companies. >> >> Alignment >> =3D=3D=3D=3D=3D=3D=3D=3D=3D >> The initial committers strongly believe that a system for interactive >> analysis of large-scale datasets will gain broader adoption as an open >> source, community driven project, where the community can contribute not >> only to the core components, but also to a growing collection of query >> languages and optimizers, data formats, data formats, and execution engi= ne >> operators and connectors. Drill will integrate closely with Apache Hadoo= p. >> First, the data will live in Hadoop. That is, Drill will support Hadoop >> FileSystem implementations and HBase. Second, Hadoop-related data format= s >> will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tool= s >> will be provided to produce column-based formats. Fourth, Drill tables c= an >> be registered in HCatalog. Finally, Hive is being considered as the basi= s >> of the DrQL implementation. >> >> Known Risks >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> Orphaned Products >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> The contributors are leading vendors in this space, with significant ope= n >> source experience, so the risk of being orphaned is relatively low. The >> project could be at risk if vendors decided to change their strategies i= n >> the market. In such an event, the current committers plan to continue >> working on the project on their own time, though the progress will likel= y >> be slower. We plan to mitigate this risk by recruiting additional >> committers. >> >> Inexperience with Open Source >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D >> The initial committers include veteran Apache members (committers and PM= C >> members) and other developers who have varying degrees of experience wit= h >> open source projects. All have been involved with source code that has b= een >> released under an open source license, and several also have experience >> developing code with an open source development process. >> >> Homogenous Developers >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> The initial committers are employed by a number of companies, including >> MapR Technologies, Concurrent and Drawn to Scale. We are committed to >> recruiting additional committers from other companies. >> >> Reliance on Salaried Developers >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >> It is expected that Drill development will occur on both salaried time a= nd >> on volunteer time, after hours. The majority of initial committers are p= aid >> by their employer to contribute to this project. However, they are all >> passionate about the project, and we are confident that the project will >> continue even if no salaried developers contribute to the project. We ar= e >> committed to recruiting additional committers including non-salaried >> developers. >> >> Relationships with Other Apache Products >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> As mentioned in the Alignment section, Drill is closely integrated with >> Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data >> lives inside a Hadoop environment (Drill operates on in situ data). We l= ook >> forward to collaborating with those communities, as well as other Apache >> communities. >> >> An Excessive Fascination with the Apache Brand >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Drill solves a real problem that many organizations struggle with, and h= as >> been proven within Google to be of significant value. The architecture i= s >> based on academic and industry research. Our rationale for developing Dr= ill >> as an Apache project is detailed in the Rationale section. We believe th= at >> the Apache brand and community process will help us attract more >> contributors to this project, and help establish ubiquitous APIs. In >> addition, establishing consensus among users and developers of a >> Dremel-like tool is a key requirement for success of the project. >> >> Documentation >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Drill is inspired by Google=92s Dremel. Google has published a paper >> highlighting Dremel=92s innovative nested column-based data format and >> execution engine: http://research.google.com/pubs/pub36632.html >> >> High-level slides have been published by MapR: TODO >> >> Initial Source >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> There is no initial source code. All source code will be developed withi= n >> the Apache Incubator. >> >> Cryptography >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Drill will eventually support encryption on the wire. This is not one of >> the initial goals, and we do not expect Drill to be a controlled export >> item due to the use of encryption. >> >> Required Resources >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> Mailing List >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> * drill-private >> * drill-dev >> * drill-user >> >> Subversion Directory >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Git is the preferred source control system: git://git.apache.org/drill >> >> Issue Tracking >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> JIRA Drill (DRILL) >> >> Initial Committers >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> * Tomer Shiran (tshiran at maprtech dot com) >> * Ted Dunning (tdunning at apache dot org) >> * Jason Frantz (jfrantz at maprtech dot com) >> * MC Srivas (mcsrivas at maprtech dot com) >> * Chris Wensel (chris and concurrentinc dot com) >> * Keys Botzum (kbotzum at maprtech dot com) >> * Gera Shegalov (gshegalov at maprtech dot com) >> * Ryan Rawson (ryan at drawntoscale dot com) >> >> Affiliations >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> The initial committers are employees of MapR Technologies, Drawn to Scal= e >> and Concurrent. The nominated mentors are employees of MapR Technologies= , >> Lucid Imagination and Nokia. >> >> Sponsors >> =3D=3D=3D=3D=3D=3D=3D=3D >> >> Champion >> =3D=3D=3D=3D=3D=3D=3D=3D >> Ted Dunning (tdunning at apache dot org) >> >> Nominated Mentors >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> * Ted Dunning (tdunning at apache dot org) =96 Chief Application Archite= ct at >> MapR Technologies, Committer for Lucene, Mahout and ZooKeeper. >> * Grant Ingersoll (grant at lucidimagination dot com) =96 Chief Scientis= t at >> Lucid Imagination, Committer for Lucene, Mahout and other projects. >> * Isabel Drost (Isabel at apache dot org) =96 Software Developer at Noki= a >> Gate 5 GmbH, Committer for Lucene, Mahout and other projects. >> >> Sponsoring Entity >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> Incubator > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org