incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <akaras...@apache.org>
Subject Re: [VOTE] Accept Drill into the Apache Incubator
Date Wed, 08 Aug 2012 06:33:03 GMT
+1 (binding)

On Wed, Aug 8, 2012 at 8:33 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> +1 (binding). Good luck and sounds cool!
>
> Cheers,
> Chris
>
> On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:
>
> > I would like to call a vote for accepting Drill for incubation in the
> > Apache Incubator. The full proposal is available below.  Discussion
> > over the last few days has been quite positive.
> >
> > Please cast your vote:
> >
> > [ ] +1, bring Drill into Incubator
> > [ ] +0, I don't care either way,
> > [ ] -1, do not bring Drill into Incubator, because...
> >
> > This vote will be open for 72 hours and only votes from the Incubator
> > PMC are binding.  The start of the vote is just before 3AM UTC on 8
> > August so the closing time will be 3AM UTC on 11 August.
> >
> > Thank you for your consideration!
> >
> > Ted
> >
> > http://wiki.apache.org/incubator/DrillProposal
> >
> > = Drill =
> >
> > == Abstract ==
> > Drill is a distributed system for interactive analysis of large-scale
> > datasets, inspired by
> > [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> >
> > == Proposal ==
> > Drill is a distributed system for interactive analysis of large-scale
> > datasets. Drill is similar to Google's Dremel, with the additional
> > flexibility needed to support a broader range of query languages, data
> > formats and data sources. It is designed to efficiently process nested
> > data. It is a design goal to scale to 10,000 servers or more and to be
> > able to process petabyes of data and trillions of records in seconds.
> >
> > == Background ==
> > Many organizations have the need to run data-intensive applications,
> > including batch processing, stream processing and interactive
> > analysis. In recent years open source systems have emerged to address
> > the need for scalable batch processing (Apache Hadoop) and stream
> > processing (Storm, Apache S4). In 2010 Google published a paper called
> > "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> > scalable system used internally for interactive analysis of nested
> > data. No open source project has successfully replicated the
> > capabilities of Dremel.
> >
> > == Rationale ==
> > There is a strong need in the market for low-latency interactive
> > analysis of large-scale datasets, including nested data (eg, JSON,
> > Avro, Protocol Buffers). This need was identified by Google and
> > addressed internally with a system called Dremel.
> >
> > In recent years open source systems have emerged to address the need
> > for scalable batch processing (Apache Hadoop) and stream processing
> > (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> > internal MapReduce system, is used by thousands of organizations
> > processing large-scale datasets. Apache Hadoop is designed to achieve
> > very high throughput, but is not designed to achieve the sub-second
> > latency needed for interactive data analysis and exploration. Drill,
> > inspired by Google's internal Dremel system, is intended to address
> > this need.
> >
> > It is worth noting that, as explained by Google in the original paper,
> > Dremel complements MapReduce-based computing. Dremel is not intended
> > as a replacement for MapReduce and is often used in conjunction with
> > it to analyze outputs of MapReduce pipelines or rapidly prototype
> > larger computations. Indeed, Dremel and MapReduce are both used by
> > thousands of Google employees.
> >
> > Like Dremel, Drill supports a nested data model with data encoded in a
> > number of formats such as JSON, Avro or Protocol Buffers. In many
> > organizations nested data is the standard, so supporting a nested data
> > model eliminates the need to normalize the data. With that said, flat
> > data formats, such as CSV files, are naturally supported as a special
> > case of nested data.
> >
> > The Drill architecture consists of four key components/layers:
> > * Query languages: This layer is responsible for parsing the user's
> > query and constructing an execution plan.  The initial goal is to
> > support the SQL-like language used by Dremel and
> > [[https://developers.google.com/bigquery/docs/query-reference|Google
> > BigQuery]], which we call DrQL. However, Drill is designed to support
> > other languages and programming models, such as the
> > [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> > Language]], [[http://www.cascading.org/|Cascading]] or
> > [[https://github.com/tdunning/Plume|Plume]].
> > * Low-latency distributed execution engine: This layer is responsible
> > for executing the physical plan. It provides the scalability and fault
> > tolerance needed to efficiently query petabytes of data on 10,000
> > servers. Drill's execution engine is based on research in distributed
> > execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> > columnar storage, and can be extended with additional operators and
> > connectors.
> > * Nested data formats: This layer is responsible for supporting
> > various data formats. The initial goal is to support the column-based
> > format used by Dremel. Drill is designed to support schema-based
> > formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> > and schema-less formats such as JSON, BSON or YAML. In addition, it is
> > designed to support column-based formats such as Dremel,
> > AVRO-806/Trevni and RCFile, and row-based formats such as Protocol
> > Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill
> > is that the execution engine is flexible enough to support
> > column-based processing as well as row-based processing. This is
> > important because column-based processing can be much more efficient
> > when the data is stored in a column-based format, but many large data
> > assets are stored in a row-based format that would require conversion
> > before use.
> > * Scalable data sources: This layer is responsible for supporting
> > various data sources. The initial focus is to leverage Hadoop as a
> > data source.
> >
> > It is worth noting that no open source project has successfully
> > replicated the capabilities of Dremel, nor have any taken on the
> > broader goals of flexibility (eg, pluggable query languages, data
> > formats, data sources and execution engine operators/connectors) that
> > are part of Drill.
> >
> > == Initial Goals ==
> > The initial goals for this project are to specify the detailed
> > requirements and architecture, and then develop the initial
> > implementation including the execution engine and DrQL.
> > Like Apache Hadoop, which was built to support multiple storage
> > systems (through the FileSystem API) and file formats (through the
> > InputFormat/OutputFormat APIs), Drill will be built to support
> > multiple query languages, data formats and data sources. The initial
> > implementation of Drill will support the DrQL and a column-based
> > format similar to Dremel.
> >
> > == Current Status ==
> > Significant work has been completed to identify the initial
> > requirements and define the overall system architecture. The next step
> > is to implement the four components described in the Rationale
> > section, and we intend to do that development as an Apache project.
> >
> > === Meritocracy ===
> > We plan to invest in supporting a meritocracy. We will discuss the
> > requirements in an open forum. Several companies have already
> > expressed interest in this project, and we intend to invite additional
> > developers to participate. We will encourage and monitor community
> > participation so that privileges can be extended to those that
> > contribute. Also, Drill has an extensible/pluggable architecture that
> > encourages developers to contribute various extensions, such as query
> > languages, data formats, data sources and execution engine operators
> > and connectors. While some companies will surely develop commercial
> > extensions, we also anticipate that some companies and individuals
> > will want to contribute such extensions back to the project, and we
> > look forward to fostering a rich ecosystem of extensions.
> >
> > === Community ===
> > The need for a system for interactive analysis of large datasets in
> > the open source is tremendous, so there is a potential for a very
> > large community. We believe that Drill's extensible architecture will
> > further encourage community participation. Also, related Apache
> > projects (eg, Hadoop) have very large and active communities, and we
> > expect that over time Drill will also attract a large community.
> >
> > === Core Developers ===
> > The developers on the initial committers list include experienced
> > distributed systems engineers:
> > * Tomer Shiran has experience developing distributed execution
> > engines. He developed Parallel DataSeries, a data-parallel version of
> > the open source [[http://tesla.hpl.hp.com/opensource/|DataSeries]]
> > system. He is also the author of Applying Idealized Lower-bound
> > Runtime Models to Understand Inefficiencies in Data-intensive
> > Computing (SIGMETRICS 2011). Tomer worked as a software developer and
> > researcher at IBM Research, Microsoft and HP Labs, and is now at MapR
> > Technologies. He has been active in the Hadoop community since 2009.
> > * Jason Frantz was at Clustrix, where he designed and developed the
> > first scale-out SQL database based on MySQL. Jason developed the
> > distributed query optimizer that powered Clustrix. He is now a
> > software engineer and architect at MapR Technologies.
> > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout,
> > and has a history of over 30 years of contributions to open source. He
> > is now at MapR Technologies. Ted has been very active in the Hadoop
> > community since the project's early days.
> > * MC Srivas is the co-founder and CTO of MapR Technologies. While at
> > Google he worked on Google's scalable search infrastructure. MC Srivas
> > has been active in the Hadoop community since 2009.
> > * Chris Wensel is the founder and CEO of Concurrent. Prior to
> > founding Concurrent, he developed Cascading, an Apache-licensed open
> > source application framework enabling Java developers to quickly and
> > easily develop robust Data Analytics and Data Management applications
> > on Apache Hadoop. Chris has been involved in the Hadoop community
> > since the project's early days.
> > * Keys Botzum was at IBM, where he worked on security and distributed
> > systems, and is currently at MapR Technologies.
> > * Gera Shegalov was at Oracle, where he worked on networking, storage
> > and database kernels, and is currently at MapR Technologies.
> > * Ryan Rawson is the VP Engineering of Drawn to Scale where he
> > developed Spire, a real-time operational database for Hadoop. He is
> > also a committer and PMC member for Apache HBase, and has a long
> > history of contributions to open source. Ryan has been involved in the
> > Hadoop community since the project's early days.
> >
> > We realize that additional employer diversity is needed, and we will
> > work aggressively to recruit developers from additional companies.
> >
> > === Alignment ===
> > The initial committers strongly believe that a system for interactive
> > analysis of large-scale datasets will gain broader adoption as an open
> > source, community driven project, where the community can contribute
> > not only to the core components, but also to a growing collection of
> > query languages and optimizers, data formats, data formats, and
> > execution engine operators and connectors. Drill will integrate
> > closely with Apache Hadoop. First, the data will live in Hadoop. That
> > is, Drill will support Hadoop FileSystem implementations and HBase.
> > Second, Hadoop-related data formats will be supported (eg, Apache
> > Avro, RCFile). Third, MapReduce-based tools will be provided to
> > produce column-based formats. Fourth, Drill tables can be registered
> > in HCatalog. Finally, Hive is being considered as the basis of the
> > DrQL implementation.
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> > The contributors are leading vendors in this space, with significant
> > open source experience, so the risk of being orphaned is relatively
> > low. The project could be at risk if vendors decided to change their
> > strategies in the market. In such an event, the current committers
> > plan to continue working on the project on their own time, though the
> > progress will likely be slower. We plan to mitigate this risk by
> > recruiting additional committers.
> >
> > === Inexperience with Open Source ===
> > The initial committers include veteran Apache members (committers and
> > PMC members) and other developers who have varying degrees of
> > experience with open source projects. All have been involved with
> > source code that has been released under an open source license, and
> > several also have experience developing code with an open source
> > development process.
> >
> > === Homogenous Developers ===
> > The initial committers are employed by a number of companies,
> > including MapR Technologies, Concurrent and Drawn to Scale. We are
> > committed to recruiting additional committers from other companies.
> >
> > === Reliance on Salaried Developers ===
> > It is expected that Drill development will occur on both salaried time
> > and on volunteer time, after hours. The majority of initial committers
> > are paid by their employer to contribute to this project. However,
> > they are all passionate about the project, and we are confident that
> > the project will continue even if no salaried developers contribute to
> > the project. We are committed to recruiting additional committers
> > including non-salaried developers.
> >
> > === Relationships with Other Apache Products ===
> > As mentioned in the Alignment section, Drill is closely integrated
> > with Hadoop, Avro, Hive and HBase in a numerous ways. For example,
> > Drill data lives inside a Hadoop environment (Drill operates on in
> > situ data). We look forward to collaborating with those communities,
> > as well as other Apache communities.
> >
> > === An Excessive Fascination with the Apache Brand ===
> > Drill solves a real problem that many organizations struggle with, and
> > has been proven within Google to be of significant value. The
> > architecture is based on academic and industry research. Our rationale
> > for developing Drill as an Apache project is detailed in the Rationale
> > section. We believe that the Apache brand and community process will
> > help us attract more contributors to this project, and help establish
> > ubiquitous APIs. In addition, establishing consensus among users and
> > developers of a Dremel-like tool is a key requirement for success of
> > the project.
> >
> > == Documentation ==
> > Drill is inspired by Google's Dremel. Google has published a
> > [[http://research.google.com/pubs/pub36632.html|paper]] highlighting
> > Dremel's innovative nested column-based data format and execution
> > engine.
> >
> > == Initial Source ==
> > The requirement and design documents are currently stored in MapR
> > Technologies' source code repository. They will be checked in as part
> > of the initial code dump.
> >
> > == Cryptography ==
> > Drill will eventually support encryption on the wire. This is not one
> > of the initial goals, and we do not expect Drill to be a controlled
> > export item due to the use of encryption.
> >
> > == Required Resources ==
> >
> > === Mailing List ===
> > * drill-private
> > * drill-dev
> > * drill-user
> >
> > === Subversion Directory ===
> > Git is the preferred source control system: git://git.apache.org/drill
> >
> > === Issue Tracking ===
> > JIRA Drill (DRILL)
> >
> > == Initial Committers ==
> > * Tomer Shiran <tshiran at maprtech dot com>
> > * Ted Dunning <tdunning at apache dot org>
> > * Jason Frantz <jfrantz at maprtech dot com>
> > * MC Srivas <mcsrivas at maprtech dot com>
> > * Chris Wensel <chris and concurrentinc dot com>
> > * Keys Botzum <kbotzum at maprtech dot com>
> > * Gera Shegalov <gshegalov at maprtech dot com>
> > * Ryan Rawson <ryan at drawntoscale dot com>
> >
> > == Affiliations ==
> > The initial committers are employees of MapR Technologies, Drawn to
> > Scale and Concurrent. The nominated mentors are employees of MapR
> > Technologies, Lucid Imagination and Nokia.
> >
> > == Sponsors ==
> >
> > === Champion ===
> > Ted Dunning (tdunning at apache dot org)
> >
> > === Nominated Mentors ===
> > * Ted Dunning <tdunning at apache dot org> – Chief Application
> > Architect at MapR Technologies, Committer for Lucene, Mahout and
> > ZooKeeper.
> > * Grant Ingersoll <grant at lucidimagination dot com> – Chief
> > Scientist at Lucid Imagination, Committer for Lucene, Mahout and other
> > projects.
> > * Isabel Drost <isabel at apache dot org> – Software Developer at
> > Nokia Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
> >
> > === Sponsoring Entity ===
> > Incubator
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>


-- 
Best Regards,
-- Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message