Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 07C77DB5D for ; Wed, 8 Aug 2012 06:33:36 +0000 (UTC) Received: (qmail 36648 invoked by uid 500); 8 Aug 2012 06:33:34 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 36150 invoked by uid 500); 8 Aug 2012 06:33:31 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 36081 invoked by uid 99); 8 Aug 2012 06:33:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 06:33:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of akarasulu@gmail.com designates 209.85.213.175 as permitted sender) Received: from [209.85.213.175] (HELO mail-yx0-f175.google.com) (209.85.213.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 06:33:24 +0000 Received: by yenm1 with SMTP id m1so397353yen.6 for ; Tue, 07 Aug 2012 23:33:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=Duv4kvAn7QTbjq7KFJ/WSMvCvMNBxgQ6gooljZ3w/5E=; b=hTkjV7Afd1yRrJAg+10y05c+PagIS18iUTW+ZaUQMPYuoYZHmTZQLtShPKhCgwdfz4 SiEAjtZUdYXGaW36ZDG1ZIBUmg3ohL0Y3qspfRPddgTQUI0CdbIukrOgaQBhaIZYfuKp /XHo1GJ6ZCIdsMybEkAEEfDJPQHJlLWa6hJv4hHeNHaBrQg0u60b7H3rUO2hKOoPmTCp ndDxHeBSmK6OxaWLJO6j0RWIqK24QQruWTpEVkPjdg6veiYahLpI/SuY22c/H2rlcErh d+uvx7SPKiPEzXTWkgQvkXUYa5xqQuDCl4FIdXO5kyMt5x1SXDK+9UnZyIXHn+e3F7Qh N1Vw== MIME-Version: 1.0 Received: by 10.50.156.196 with SMTP id wg4mr816903igb.54.1344407583496; Tue, 07 Aug 2012 23:33:03 -0700 (PDT) Sender: akarasulu@gmail.com Received: by 10.64.73.37 with HTTP; Tue, 7 Aug 2012 23:33:03 -0700 (PDT) In-Reply-To: <82A390B3-F50A-402E-814B-92CD3315E99C@jpl.nasa.gov> References: <82A390B3-F50A-402E-814B-92CD3315E99C@jpl.nasa.gov> Date: Wed, 8 Aug 2012 09:33:03 +0300 X-Google-Sender-Auth: ZcGLxgOv1jN5psuDkinHrpDBxj8 Message-ID: Subject: Re: [VOTE] Accept Drill into the Apache Incubator From: Alex Karasulu To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=e89a8f3baf017299df04c6bb4803 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f3baf017299df04c6bb4803 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable +1 (binding) On Wed, Aug 8, 2012 at 8:33 AM, Mattmann, Chris A (388J) < chris.a.mattmann@jpl.nasa.gov> wrote: > +1 (binding). Good luck and sounds cool! > > Cheers, > Chris > > On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote: > > > I would like to call a vote for accepting Drill for incubation in the > > Apache Incubator. The full proposal is available below. Discussion > > over the last few days has been quite positive. > > > > Please cast your vote: > > > > [ ] +1, bring Drill into Incubator > > [ ] +0, I don't care either way, > > [ ] -1, do not bring Drill into Incubator, because... > > > > This vote will be open for 72 hours and only votes from the Incubator > > PMC are binding. The start of the vote is just before 3AM UTC on 8 > > August so the closing time will be 3AM UTC on 11 August. > > > > Thank you for your consideration! > > > > Ted > > > > http://wiki.apache.org/incubator/DrillProposal > > > > =3D Drill =3D > > > > =3D=3D Abstract =3D=3D > > Drill is a distributed system for interactive analysis of large-scale > > datasets, inspired by > > [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. > > > > =3D=3D Proposal =3D=3D > > Drill is a distributed system for interactive analysis of large-scale > > datasets. Drill is similar to Google's Dremel, with the additional > > flexibility needed to support a broader range of query languages, data > > formats and data sources. It is designed to efficiently process nested > > data. It is a design goal to scale to 10,000 servers or more and to be > > able to process petabyes of data and trillions of records in seconds. > > > > =3D=3D Background =3D=3D > > Many organizations have the need to run data-intensive applications, > > including batch processing, stream processing and interactive > > analysis. In recent years open source systems have emerged to address > > the need for scalable batch processing (Apache Hadoop) and stream > > processing (Storm, Apache S4). In 2010 Google published a paper called > > "Dremel: Interactive Analysis of Web-Scale Datasets," describing a > > scalable system used internally for interactive analysis of nested > > data. No open source project has successfully replicated the > > capabilities of Dremel. > > > > =3D=3D Rationale =3D=3D > > There is a strong need in the market for low-latency interactive > > analysis of large-scale datasets, including nested data (eg, JSON, > > Avro, Protocol Buffers). This need was identified by Google and > > addressed internally with a system called Dremel. > > > > In recent years open source systems have emerged to address the need > > for scalable batch processing (Apache Hadoop) and stream processing > > (Storm, Apache S4). Apache Hadoop, originally inspired by Google's > > internal MapReduce system, is used by thousands of organizations > > processing large-scale datasets. Apache Hadoop is designed to achieve > > very high throughput, but is not designed to achieve the sub-second > > latency needed for interactive data analysis and exploration. Drill, > > inspired by Google's internal Dremel system, is intended to address > > this need. > > > > It is worth noting that, as explained by Google in the original paper, > > Dremel complements MapReduce-based computing. Dremel is not intended > > as a replacement for MapReduce and is often used in conjunction with > > it to analyze outputs of MapReduce pipelines or rapidly prototype > > larger computations. Indeed, Dremel and MapReduce are both used by > > thousands of Google employees. > > > > Like Dremel, Drill supports a nested data model with data encoded in a > > number of formats such as JSON, Avro or Protocol Buffers. In many > > organizations nested data is the standard, so supporting a nested data > > model eliminates the need to normalize the data. With that said, flat > > data formats, such as CSV files, are naturally supported as a special > > case of nested data. > > > > The Drill architecture consists of four key components/layers: > > * Query languages: This layer is responsible for parsing the user's > > query and constructing an execution plan. The initial goal is to > > support the SQL-like language used by Dremel and > > [[https://developers.google.com/bigquery/docs/query-reference|Google > > BigQuery]], which we call DrQL. However, Drill is designed to support > > other languages and programming models, such as the > > [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query > > Language]], [[http://www.cascading.org/|Cascading]] or > > [[https://github.com/tdunning/Plume|Plume]]. > > * Low-latency distributed execution engine: This layer is responsible > > for executing the physical plan. It provides the scalability and fault > > tolerance needed to efficiently query petabytes of data on 10,000 > > servers. Drill's execution engine is based on research in distributed > > execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and > > columnar storage, and can be extended with additional operators and > > connectors. > > * Nested data formats: This layer is responsible for supporting > > various data formats. The initial goal is to support the column-based > > format used by Dremel. Drill is designed to support schema-based > > formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, > > and schema-less formats such as JSON, BSON or YAML. In addition, it is > > designed to support column-based formats such as Dremel, > > AVRO-806/Trevni and RCFile, and row-based formats such as Protocol > > Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill > > is that the execution engine is flexible enough to support > > column-based processing as well as row-based processing. This is > > important because column-based processing can be much more efficient > > when the data is stored in a column-based format, but many large data > > assets are stored in a row-based format that would require conversion > > before use. > > * Scalable data sources: This layer is responsible for supporting > > various data sources. The initial focus is to leverage Hadoop as a > > data source. > > > > It is worth noting that no open source project has successfully > > replicated the capabilities of Dremel, nor have any taken on the > > broader goals of flexibility (eg, pluggable query languages, data > > formats, data sources and execution engine operators/connectors) that > > are part of Drill. > > > > =3D=3D Initial Goals =3D=3D > > The initial goals for this project are to specify the detailed > > requirements and architecture, and then develop the initial > > implementation including the execution engine and DrQL. > > Like Apache Hadoop, which was built to support multiple storage > > systems (through the FileSystem API) and file formats (through the > > InputFormat/OutputFormat APIs), Drill will be built to support > > multiple query languages, data formats and data sources. The initial > > implementation of Drill will support the DrQL and a column-based > > format similar to Dremel. > > > > =3D=3D Current Status =3D=3D > > Significant work has been completed to identify the initial > > requirements and define the overall system architecture. The next step > > is to implement the four components described in the Rationale > > section, and we intend to do that development as an Apache project. > > > > =3D=3D=3D Meritocracy =3D=3D=3D > > We plan to invest in supporting a meritocracy. We will discuss the > > requirements in an open forum. Several companies have already > > expressed interest in this project, and we intend to invite additional > > developers to participate. We will encourage and monitor community > > participation so that privileges can be extended to those that > > contribute. Also, Drill has an extensible/pluggable architecture that > > encourages developers to contribute various extensions, such as query > > languages, data formats, data sources and execution engine operators > > and connectors. While some companies will surely develop commercial > > extensions, we also anticipate that some companies and individuals > > will want to contribute such extensions back to the project, and we > > look forward to fostering a rich ecosystem of extensions. > > > > =3D=3D=3D Community =3D=3D=3D > > The need for a system for interactive analysis of large datasets in > > the open source is tremendous, so there is a potential for a very > > large community. We believe that Drill's extensible architecture will > > further encourage community participation. Also, related Apache > > projects (eg, Hadoop) have very large and active communities, and we > > expect that over time Drill will also attract a large community. > > > > =3D=3D=3D Core Developers =3D=3D=3D > > The developers on the initial committers list include experienced > > distributed systems engineers: > > * Tomer Shiran has experience developing distributed execution > > engines. He developed Parallel DataSeries, a data-parallel version of > > the open source [[http://tesla.hpl.hp.com/opensource/|DataSeries]] > > system. He is also the author of Applying Idealized Lower-bound > > Runtime Models to Understand Inefficiencies in Data-intensive > > Computing (SIGMETRICS 2011). Tomer worked as a software developer and > > researcher at IBM Research, Microsoft and HP Labs, and is now at MapR > > Technologies. He has been active in the Hadoop community since 2009. > > * Jason Frantz was at Clustrix, where he designed and developed the > > first scale-out SQL database based on MySQL. Jason developed the > > distributed query optimizer that powered Clustrix. He is now a > > software engineer and architect at MapR Technologies. > > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, > > and has a history of over 30 years of contributions to open source. He > > is now at MapR Technologies. Ted has been very active in the Hadoop > > community since the project's early days. > > * MC Srivas is the co-founder and CTO of MapR Technologies. While at > > Google he worked on Google's scalable search infrastructure. MC Srivas > > has been active in the Hadoop community since 2009. > > * Chris Wensel is the founder and CEO of Concurrent. Prior to > > founding Concurrent, he developed Cascading, an Apache-licensed open > > source application framework enabling Java developers to quickly and > > easily develop robust Data Analytics and Data Management applications > > on Apache Hadoop. Chris has been involved in the Hadoop community > > since the project's early days. > > * Keys Botzum was at IBM, where he worked on security and distributed > > systems, and is currently at MapR Technologies. > > * Gera Shegalov was at Oracle, where he worked on networking, storage > > and database kernels, and is currently at MapR Technologies. > > * Ryan Rawson is the VP Engineering of Drawn to Scale where he > > developed Spire, a real-time operational database for Hadoop. He is > > also a committer and PMC member for Apache HBase, and has a long > > history of contributions to open source. Ryan has been involved in the > > Hadoop community since the project's early days. > > > > We realize that additional employer diversity is needed, and we will > > work aggressively to recruit developers from additional companies. > > > > =3D=3D=3D Alignment =3D=3D=3D > > The initial committers strongly believe that a system for interactive > > analysis of large-scale datasets will gain broader adoption as an open > > source, community driven project, where the community can contribute > > not only to the core components, but also to a growing collection of > > query languages and optimizers, data formats, data formats, and > > execution engine operators and connectors. Drill will integrate > > closely with Apache Hadoop. First, the data will live in Hadoop. That > > is, Drill will support Hadoop FileSystem implementations and HBase. > > Second, Hadoop-related data formats will be supported (eg, Apache > > Avro, RCFile). Third, MapReduce-based tools will be provided to > > produce column-based formats. Fourth, Drill tables can be registered > > in HCatalog. Finally, Hive is being considered as the basis of the > > DrQL implementation. > > > > =3D=3D Known Risks =3D=3D > > > > =3D=3D=3D Orphaned Products =3D=3D=3D > > The contributors are leading vendors in this space, with significant > > open source experience, so the risk of being orphaned is relatively > > low. The project could be at risk if vendors decided to change their > > strategies in the market. In such an event, the current committers > > plan to continue working on the project on their own time, though the > > progress will likely be slower. We plan to mitigate this risk by > > recruiting additional committers. > > > > =3D=3D=3D Inexperience with Open Source =3D=3D=3D > > The initial committers include veteran Apache members (committers and > > PMC members) and other developers who have varying degrees of > > experience with open source projects. All have been involved with > > source code that has been released under an open source license, and > > several also have experience developing code with an open source > > development process. > > > > =3D=3D=3D Homogenous Developers =3D=3D=3D > > The initial committers are employed by a number of companies, > > including MapR Technologies, Concurrent and Drawn to Scale. We are > > committed to recruiting additional committers from other companies. > > > > =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D > > It is expected that Drill development will occur on both salaried time > > and on volunteer time, after hours. The majority of initial committers > > are paid by their employer to contribute to this project. However, > > they are all passionate about the project, and we are confident that > > the project will continue even if no salaried developers contribute to > > the project. We are committed to recruiting additional committers > > including non-salaried developers. > > > > =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D > > As mentioned in the Alignment section, Drill is closely integrated > > with Hadoop, Avro, Hive and HBase in a numerous ways. For example, > > Drill data lives inside a Hadoop environment (Drill operates on in > > situ data). We look forward to collaborating with those communities, > > as well as other Apache communities. > > > > =3D=3D=3D An Excessive Fascination with the Apache Brand =3D=3D=3D > > Drill solves a real problem that many organizations struggle with, and > > has been proven within Google to be of significant value. The > > architecture is based on academic and industry research. Our rationale > > for developing Drill as an Apache project is detailed in the Rationale > > section. We believe that the Apache brand and community process will > > help us attract more contributors to this project, and help establish > > ubiquitous APIs. In addition, establishing consensus among users and > > developers of a Dremel-like tool is a key requirement for success of > > the project. > > > > =3D=3D Documentation =3D=3D > > Drill is inspired by Google's Dremel. Google has published a > > [[http://research.google.com/pubs/pub36632.html|paper]] highlighting > > Dremel's innovative nested column-based data format and execution > > engine. > > > > =3D=3D Initial Source =3D=3D > > The requirement and design documents are currently stored in MapR > > Technologies' source code repository. They will be checked in as part > > of the initial code dump. > > > > =3D=3D Cryptography =3D=3D > > Drill will eventually support encryption on the wire. This is not one > > of the initial goals, and we do not expect Drill to be a controlled > > export item due to the use of encryption. > > > > =3D=3D Required Resources =3D=3D > > > > =3D=3D=3D Mailing List =3D=3D=3D > > * drill-private > > * drill-dev > > * drill-user > > > > =3D=3D=3D Subversion Directory =3D=3D=3D > > Git is the preferred source control system: git://git.apache.org/drill > > > > =3D=3D=3D Issue Tracking =3D=3D=3D > > JIRA Drill (DRILL) > > > > =3D=3D Initial Committers =3D=3D > > * Tomer Shiran > > * Ted Dunning > > * Jason Frantz > > * MC Srivas > > * Chris Wensel > > * Keys Botzum > > * Gera Shegalov > > * Ryan Rawson > > > > =3D=3D Affiliations =3D=3D > > The initial committers are employees of MapR Technologies, Drawn to > > Scale and Concurrent. The nominated mentors are employees of MapR > > Technologies, Lucid Imagination and Nokia. > > > > =3D=3D Sponsors =3D=3D > > > > =3D=3D=3D Champion =3D=3D=3D > > Ted Dunning (tdunning at apache dot org) > > > > =3D=3D=3D Nominated Mentors =3D=3D=3D > > * Ted Dunning =96 Chief Application > > Architect at MapR Technologies, Committer for Lucene, Mahout and > > ZooKeeper. > > * Grant Ingersoll =96 Chief > > Scientist at Lucid Imagination, Committer for Lucene, Mahout and other > > projects. > > * Isabel Drost =96 Software Developer at > > Nokia Gate 5 GmbH, Committer for Lucene, Mahout and other projects. > > > > =3D=3D=3D Sponsoring Entity =3D=3D=3D > > Incubator > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > > For additional commands, e-mail: general-help@incubator.apache.org > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattmann@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org > > --=20 Best Regards, -- Alex --e89a8f3baf017299df04c6bb4803--