incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [PROPOSAL] Drill for the Apache Incubator
Date Thu, 16 Aug 2012 14:18:20 GMT
The mailing list request is in infra's hands.

One of the better sources of information about Dremel is the BigQuery
documentation.  That says that the right side of a join must be < 8MB and
that the only outer join available is a left out join.

What Drill does is somewhat of a different question.

On Thu, Aug 16, 2012 at 12:18 AM, Tomer Shiran <tshiran@maprtech.com> wrote:

> Yes, we plan to support joins.
>
> We are in the process of setting up the mailing lists.
>
> On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga <karthik.tunga@gmail.com
> >wrote:
>
> > The proposal looks great. I was wondering what operations will drill
> > support ?
> > For example the dremel paper doesn't talk about joins, will drill support
> > joins ?
> >
> > Sorry if I missed it, is there a dev mailing list I could subscribe to ?
> >
> > Cheers,
> > Karthik
> >
> > On 13 August 2012 23:55, Bernd Fondermann <bernd.fondermann@gmail.com
> > >wrote:
> >
> > > great proposal and a very promising mentor lineup.
> > >
> > > Have fun,
> > >
> > >   Bernd
> > >
> > > On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning <tdunning@apache.org>
> > wrote:
> > > > Abstract
> > > > ========
> > > > Drill is a distributed system for interactive analysis of large-scale
> > > > datasets, inspired by Google’s Dremel (
> > > > http://research.google.com/pubs/pub36632.html).
> > > >
> > > > Proposal
> > > > ========
> > > > Drill is a distributed system for interactive analysis of large-scale
> > > > datasets. Drill is similar to Google’s Dremel, with the additional
> > > > flexibility needed to support a broader range of query languages,
> data
> > > > formats and data sources. It is designed to efficiently process
> nested
> > > > data. It is a design goal to scale to 10,000 servers or more and to
> be
> > > able
> > > > to process petabyes of data and trillions of records in seconds.
> > > >
> > > > Background
> > > > ==========
> > > > Many organizations have the need to run data-intensive applications,
> > > > including batch processing, stream processing and interactive
> analysis.
> > > In
> > > > recent years open source systems have emerged to address the need for
> > > > scalable batch processing (Apache Hadoop) and stream processing
> (Storm,
> > > > Apache S4). In 2010 Google published a paper called “Dremel:
> > Interactive
> > > > Analysis of Web-Scale Datasets,” describing a scalable system used
> > > > internally for interactive analysis of nested data. No open source
> > > project
> > > > has successfully replicated the capabilities of Dremel.
> > > >
> > > > Rationale
> > > > =========
> > > > There is a strong need in the market for low-latency interactive
> > analysis
> > > > of large-scale datasets, including nested data (eg, JSON, Avro,
> > Protocol
> > > > Buffers). This need was identified by Google and addressed internally
> > > with
> > > > a system called Dremel.
> > > >
> > > > In recent years open source systems have emerged to address the need
> > for
> > > > scalable batch processing (Apache Hadoop) and stream processing
> (Storm,
> > > > Apache S4). Apache Hadoop, originally inspired by Google’s internal
> > > > MapReduce system, is used by thousands of organizations processing
> > > > large-scale datasets. Apache Hadoop is designed to achieve very high
> > > > throughput, but is not designed to achieve the sub-second latency
> > needed
> > > > for interactive data analysis and exploration. Drill, inspired by
> > > Google’s
> > > > internal Dremel system, is intended to address this need.
> > > >
> > > > It is worth noting that, as explained by Google in the original
> paper,
> > > > Dremel complements MapReduce-based computing. Dremel is not intended
> > as a
> > > > replacement for MapReduce and is often used in conjunction with it to
> > > > analyze outputs of MapReduce pipelines or rapidly prototype larger
> > > > computations. Indeed, Dremel and MapReduce are both used by thousands
> > of
> > > > Google employees.
> > > >
> > > > Like Dremel, Drill supports a nested data model with data encoded in
> a
> > > > number of formats such as JSON, Avro or Protocol Buffers. In many
> > > > organizations nested data is the standard, so supporting a nested
> data
> > > > model eliminates the need to normalize the data. With that said, flat
> > > data
> > > > formats, such as CSV files, are naturally supported as a special case
> > of
> > > > nested data.
> > > >
> > > > The Drill architecture consists of four key components/layers:
> > > > * Query languages: This layer is responsible for parsing the user’s
> > query
> > > > and constructing an execution plan.  The initial goal is to support
> the
> > > > SQL-like language used by Dremel and Google BigQuery (
> > > > https://developers.google.com/bigquery/docs/query-reference), which
> we
> > > call
> > > > DrQL. However, Drill is designed to support other languages and
> > > programming
> > > > models, such as the Mongo Query Language (
> > > > http://www.mongodb.org/display/DOCS/Mongo+Query+Language),
> Cascading (
> > > > http://www.cascading.org/) or Plume (
> https://github.com/tdunning/Plume
> > ).
> > > > * Low-latency distributed execution engine: This layer is responsible
> > for
> > > > executing the physical plan. It provides the scalability and fault
> > > > tolerance needed to efficiently query petabytes of data on 10,000
> > > servers.
> > > > Drill’s execution engine is based on research in distributed
> execution
> > > > engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
> > > > storage, and can be extended with additional operators and
> connectors.
> > > > * Nested data formats: This layer is responsible for supporting
> various
> > > > data formats. The initial goal is to support the column-based format
> > used
> > > > by Dremel. Drill is designed to support schema-based formats such as
> > > > Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and
> schema-less
> > > > formats such as JSON, BSON or YAML. In addition, it is designed to
> > > support
> > > > column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
> > > > row-based formats such as Protocol Buffers, Avro, JSON, BSON and
> CSV. A
> > > > particular distinction with Drill is that the execution engine is
> > > flexible
> > > > enough to support column-based processing as well as row-based
> > > processing.
> > > > This is important because column-based processing can be much more
> > > > efficient when the data is stored in a column-based format, but many
> > > large
> > > > data assets are stored in a row-based format that would require
> > > conversion
> > > > before use.
> > > > * Scalable data sources: This layer is responsible for supporting
> > various
> > > > data sources. The initial focus is to leverage Hadoop as a data
> source.
> > > >
> > > > It is worth noting that no open source project has successfully
> > > replicated
> > > > the capabilities of Dremel, nor have any taken on the broader goals
> of
> > > > flexibility (eg, pluggable query languages, data formats, data
> sources
> > > and
> > > > execution engine operators/connectors) that are part of Drill.
> > > >
> > > > Initial Goals
> > > > =============
> > > > The initial goals for this project are to specify the detailed
> > > requirements
> > > > and architecture, and then develop the initial implementation
> including
> > > the
> > > > execution engine and DrQL.
> > > > Like Apache Hadoop, which was built to support multiple storage
> systems
> > > > (through the FileSystem API) and file formats (through the
> > > > InputFormat/OutputFormat APIs), Drill will be built to support
> multiple
> > > > query languages, data formats and data sources. The initial
> > > implementation
> > > > of Drill will support the DrQL and a column-based format similar to
> > > Dremel.
> > > >
> > > > Current Status
> > > > ==============
> > > > Significant work has been completed to identify the initial
> > requirements
> > > > and define the overall system architecture. The next step is to
> > implement
> > > > the four components described in the Rationale section, and we intend
> > to
> > > do
> > > > that development as an Apache project.
> > > >
> > > > Meritocracy
> > > > ===========
> > > > We plan to invest in supporting a meritocracy. We will discuss the
> > > > requirements in an open forum. Several companies have already
> expressed
> > > > interest in this project, and we intend to invite additional
> developers
> > > to
> > > > participate. We will encourage and monitor community participation so
> > > that
> > > > privileges can be extended to those that contribute. Also, Drill has
> an
> > > > extensible/pluggable architecture that encourages developers to
> > > contribute
> > > > various extensions, such as query languages, data formats, data
> sources
> > > and
> > > > execution engine operators and connectors. While some companies will
> > > surely
> > > > develop commercial extensions, we also anticipate that some companies
> > and
> > > > individuals will want to contribute such extensions back to the
> > project,
> > > > and we look forward to fostering a rich ecosystem of extensions.
> > > >
> > > > Community
> > > > =========
> > > > The need for a system for interactive analysis of large datasets in
> the
> > > > open source is tremendous, so there is a potential for a very large
> > > > community. We believe that Drill’s extensible architecture will
> further
> > > > encourage community participation. Also, related Apache projects (eg,
> > > > Hadoop) have very large and active communities, and we expect that
> over
> > > > time Drill will also attract a large community.
> > > >
> > > > Core Developers
> > > > ===============
> > > > The developers on the initial committers list include experienced
> > > > distributed systems engineers:
> > > > * Tomer Shiran has experience developing distributed execution
> engines.
> > > He
> > > > developed Parallel DataSeries, a data-parallel version of the open
> > source
> > > > DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also
> > the
> > > > author of Applying Idealized Lower-bound Runtime Models to Understand
> > > > Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer
> > > worked
> > > > as a software developer and researcher at IBM Research, Microsoft and
> > HP
> > > > Labs, and is now at MapR Technologies. He has been active in the
> Hadoop
> > > > community since 2009.
> > > > * Jason Frantz was at Clustrix, where he designed and developed the
> > first
> > > > scale-out SQL database based on MySQL. Jason developed the
> distributed
> > > > query optimizer that powered Clustrix. He is now a software engineer
> > and
> > > > architect at MapR Technologies.
> > > > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout,
> > and
> > > > has a history of over 30 years of contributions to open source. He is
> > now
> > > > at MapR Technologies. Ted has been very active in the Hadoop
> community
> > > > since the project’s early days.
> > > > * MC Srivas is the co-founder and CTO of MapR Technologies. While at
> > > Google
> > > > he worked on Google’s scalable search infrastructure. MC Srivas has
> > been
> > > > active in the Hadoop community since 2009.
> > > > * Chris Wensel is the founder and CEO of Concurrent. Prior to
> founding
> > > > Concurrent, he developed Cascading, an Apache-licensed open source
> > > > application framework enabling Java developers to quickly and easily
> > > > develop robust Data Analytics and Data Management applications on
> > Apache
> > > > Hadoop. Chris has been involved in the Hadoop community since the
> > > project's
> > > > early days.
> > > > * Keys Botzum was at IBM, where he worked on security and distributed
> > > > systems, and is currently at MapR Technologies.
> > > > * Gera Shegalov was at Oracle, where he worked on networking, storage
> > and
> > > > database kernels, and is currently at MapR Technologies.
> > > > * Ryan Rawson is the VP Engineering of Drawn to Scale where he
> > developed
> > > > Spire, a real-time operational database for Hadoop. He is also a
> > > committer
> > > > and PMC member for Apache HBase, and has a long history of
> > contributions
> > > to
> > > > open source. Ryan has been involved in the Hadoop community since the
> > > > project's early days.
> > > >
> > > > We realize that additional employer diversity is needed, and we will
> > work
> > > > aggressively to recruit developers from additional companies.
> > > >
> > > > Alignment
> > > > =========
> > > > The initial committers strongly believe that a system for interactive
> > > > analysis of large-scale datasets will gain broader adoption as an
> open
> > > > source, community driven project, where the community can contribute
> > not
> > > > only to the core components, but also to a growing collection of
> query
> > > > languages and optimizers, data formats, data formats, and execution
> > > engine
> > > > operators and connectors. Drill will integrate closely with Apache
> > > Hadoop.
> > > > First, the data will live in Hadoop. That is, Drill will support
> Hadoop
> > > > FileSystem implementations and HBase. Second, Hadoop-related data
> > formats
> > > > will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based
> > tools
> > > > will be provided to produce column-based formats. Fourth, Drill
> tables
> > > can
> > > > be registered in HCatalog. Finally, Hive is being considered as the
> > basis
> > > > of the DrQL implementation.
> > > >
> > > > Known Risks
> > > > ===========
> > > >
> > > > Orphaned Products
> > > > =================
> > > > The contributors are leading vendors in this space, with significant
> > open
> > > > source experience, so the risk of being orphaned is relatively low.
> The
> > > > project could be at risk if vendors decided to change their
> strategies
> > in
> > > > the market. In such an event, the current committers plan to continue
> > > > working on the project on their own time, though the progress will
> > likely
> > > > be slower. We plan to mitigate this risk by recruiting additional
> > > > committers.
> > > >
> > > > Inexperience with Open Source
> > > > =============================
> > > > The initial committers include veteran Apache members (committers and
> > PMC
> > > > members) and other developers who have varying degrees of experience
> > with
> > > > open source projects. All have been involved with source code that
> has
> > > been
> > > > released under an open source license, and several also have
> experience
> > > > developing code with an open source development process.
> > > >
> > > > Homogenous Developers
> > > > =====================
> > > > The initial committers are employed by a number of companies,
> including
> > > > MapR Technologies, Concurrent and Drawn to Scale. We are committed to
> > > > recruiting additional committers from other companies.
> > > >
> > > > Reliance on Salaried Developers
> > > > ===============================
> > > > It is expected that Drill development will occur on both salaried
> time
> > > and
> > > > on volunteer time, after hours. The majority of initial committers
> are
> > > paid
> > > > by their employer to contribute to this project. However, they are
> all
> > > > passionate about the project, and we are confident that the project
> > will
> > > > continue even if no salaried developers contribute to the project. We
> > are
> > > > committed to recruiting additional committers including non-salaried
> > > > developers.
> > > >
> > > > Relationships with Other Apache Products
> > > > ========================================
> > > > As mentioned in the Alignment section, Drill is closely integrated
> with
> > > > Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill
> > data
> > > > lives inside a Hadoop environment (Drill operates on in situ data).
> We
> > > look
> > > > forward to collaborating with those communities, as well as other
> > Apache
> > > > communities.
> > > >
> > > > An Excessive Fascination with the Apache Brand
> > > > ==============================================
> > > > Drill solves a real problem that many organizations struggle with,
> and
> > > has
> > > > been proven within Google to be of significant value. The
> architecture
> > is
> > > > based on academic and industry research. Our rationale for developing
> > > Drill
> > > > as an Apache project is detailed in the Rationale section. We believe
> > > that
> > > > the Apache brand and community process will help us attract more
> > > > contributors to this project, and help establish ubiquitous APIs. In
> > > > addition, establishing consensus among users and developers of a
> > > > Dremel-like tool is a key requirement for success of the project.
> > > >
> > > > Documentation
> > > > =============
> > > > Drill is inspired by Google’s Dremel. Google has published a paper
> > > > highlighting Dremel’s innovative nested column-based data format and
> > > > execution engine: http://research.google.com/pubs/pub36632.html
> > > >
> > > > High-level slides have been published by MapR: TODO
> > > >
> > > > Initial Source
> > > > ==============
> > > > There is no initial source code. All source code will be developed
> > within
> > > > the Apache Incubator.
> > > >
> > > > Cryptography
> > > > ============
> > > > Drill will eventually support encryption on the wire. This is not one
> > of
> > > > the initial goals, and we do not expect Drill to be a controlled
> export
> > > > item due to the use of encryption.
> > > >
> > > > Required Resources
> > > > ==================
> > > >
> > > > Mailing List
> > > > ============
> > > > * drill-private
> > > > * drill-dev
> > > > * drill-user
> > > >
> > > > Subversion Directory
> > > > ====================
> > > > Git is the preferred source control system: git://
> git.apache.org/drill
> > > >
> > > > Issue Tracking
> > > > ==============
> > > > JIRA Drill (DRILL)
> > > >
> > > > Initial Committers
> > > > ==================
> > > > * Tomer Shiran (tshiran at maprtech dot com)
> > > > * Ted Dunning (tdunning at apache dot org)
> > > > * Jason Frantz (jfrantz at maprtech dot com)
> > > > * MC Srivas (mcsrivas at maprtech dot com)
> > > > * Chris Wensel (chris and concurrentinc dot com)
> > > > * Keys Botzum (kbotzum at maprtech dot com)
> > > > * Gera Shegalov (gshegalov at maprtech dot com)
> > > > * Ryan Rawson (ryan at drawntoscale dot com)
> > > >
> > > > Affiliations
> > > > ============
> > > > The initial committers are employees of MapR Technologies, Drawn to
> > Scale
> > > > and Concurrent. The nominated mentors are employees of MapR
> > Technologies,
> > > > Lucid Imagination and Nokia.
> > > >
> > > > Sponsors
> > > > ========
> > > >
> > > > Champion
> > > > ========
> > > > Ted Dunning (tdunning at apache dot org)
> > > >
> > > > Nominated Mentors
> > > > =================
> > > > * Ted Dunning (tdunning at apache dot org) – Chief Application
> > Architect
> > > at
> > > > MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
> > > > * Grant Ingersoll (grant at lucidimagination dot com) – Chief
> Scientist
> > > at
> > > > Lucid Imagination, Committer for Lucene, Mahout and other projects.
> > > > * Isabel Drost (Isabel at apache dot org) – Software Developer at
> Nokia
> > > > Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
> > > >
> > > > Sponsoring Entity
> > > > =================
> > > > Incubator
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>
>
>
> --
> Tomer Shiran
> Director of Product Management | MapR Technologies | 650-804-8657
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message