Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1C454981 for ; Fri, 27 May 2011 23:46:20 +0000 (UTC) Received: (qmail 6815 invoked by uid 500); 27 May 2011 23:46:20 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 6648 invoked by uid 500); 27 May 2011 23:46:19 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 6640 invoked by uid 99); 27 May 2011 23:46:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 May 2011 23:46:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of esammer@cloudera.com designates 74.125.83.175 as permitted sender) Received: from [74.125.83.175] (HELO mail-pv0-f175.google.com) (74.125.83.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 May 2011 23:46:15 +0000 Received: by pvc30 with SMTP id 30so1032869pvc.6 for ; Fri, 27 May 2011 16:45:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.28.41 with SMTP id y9mr1069235pbg.297.1306539955082; Fri, 27 May 2011 16:45:55 -0700 (PDT) Received: by 10.68.41.70 with HTTP; Fri, 27 May 2011 16:45:55 -0700 (PDT) In-Reply-To: References: Date: Fri, 27 May 2011 16:45:55 -0700 Message-ID: Subject: Re: [PROPOSAL] Sqoop Project From: Eric Sammer To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=bcaec520f311e855db04a44a8991 --bcaec520f311e855db04a44a8991 Content-Type: text/plain; charset=ISO-8859-1 +1 to sqoop entering the incubator. On Fri, May 27, 2011 at 11:40 AM, arvind@cloudera.com wrote: > Greetings All, > > We would like to propose Sqoop Project for inclusion in ASF Incubator as a > new podling. Sqoop is a tool designed for efficiently transferring bulk > data > between Apache Hadoop and structured datastores such as relational > databases. The complete proposal can be found at: > > http://wiki.apache.org/incubator/SqoopProposal > > The initial contents of this proposal are also pasted below for > convenience. > > Thanks and Regards, > Arvind Prabhakar > > = Sqoop - A Data Transfer Tool for Hadoop = > > == Abstract == > > Sqoop is a tool designed for efficiently transferring bulk data between > Apache Hadoop and structured datastores such as relational databases. You > can use Sqoop to import data from external structured datastores into > Hadoop > Distributed File System or related systems like Hive and HBase. Conversely, > Sqoop can be used to extract data from Hadoop and export it to external > structured datastores such as relational databases and enterprise data > warehouses. > > == Proposal == > > Hadoop and related systems operate on large volumes of data. Typically this > data originates from outside of Hadoop infrastructure and must be > provisioned for consumption by Hadoop and related systems for analysis and > processing. Sqoop allows fast provisioning of data into Hadoop and related > systems by providing a bulk import and export mechanism that enables > consumers to effectively use Hadoop for data analysis and processing. > > == Background == > > Sqoop was initially developed by Cloudera to enable the import and export > of > data between various databases and Hadoop Distributed File System (HDFS). > It > was provided as a patch to Hadoop project via the issue [[ > https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was > maintained as a contrib module to Hadoop between May 2009 to April 2010. In > April 2010, Sqoop was removed from Hadoop contrib via [[ > https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]] and > was made available by Cloudera on [[ > http://github.com/cloudera/sqoop|GitHub]]. > > > Since then Sqoop has been maintained by Cloudera as an open source project > on GitHub. All code available in Sqoop is open source and made publicaly > available under the Apache 2 license. During this time Sqoop has been > formally released three times as versions 1.0, 1.1 and 1.2. > > == Rationale == > > Hadoop is often used to process data that originated or is later served by > structured data stores such as relational databases, spreadsheets or > enterprise data warehouses. Unfortunately, current methods of transferring > data are inefficient and ad hoc, often consisting of manual steps specific > to the external system. These steps are necessary to help provision this > data for consumption by Map-Reduce jobs, or by systems that build on top of > Hadoop such as Hive and Pig. The transfer of this data can take substantial > amount of time depending upon its size. An optimal transfer approach that > works well with one particular datastore will typically not work as > optimally with another datastore due to inherent architectural differences > between different datastore implementations. Sqoop addresses this problem > by > providing connectivity of Hadoop with external systems via pluggable > connectors. Specialized connectors are developed for optimal performance > for > data transfer between Hadoop and target systems. > > Analyzed and processed data from Hadoop and related systems may also > require > to be provisioned outside of Hadoop for consumption by business > applications. Sqoop allows the export of data from Hadoop to external > systems to facilitate its use in other systems. This too, like the import > scenario, is implemented via specialized connectors that are built for the > purposes of optimal integration between Hadoop and external systems. > > Connectors can be built for systems that Sqoop does not yet integrate with > and thus can be easily incorporated into Sqoop. Connectors allow Sqoop to > interface with external systems of different types, ensuring that newer > systems can integrate with Hadoop with relative ease and in a consistent > manner. > > Besides allowing integration with other external systems, Sqoop provides > tight integration with systems that build on to of Hadoop such as Hive, > HBase etc - thus providing data integration between Hadoop based systems > and > external systems in a single step manner. > > == Initial Goals == > > Sqoop is currently in its first major release with a considerable number of > enhancement requests, tasks, and issues logged towards its future > development. The initial goal of this project will be to address the highly > requested features and bug-fixes towards its next dot release. The key > features of interest are the following: > * Support for bulk import into Apache HBase. > * Allow user to supply password in permission protected file. > * Support for pluggable query to help Sqoop identify the metadata > associated with the source or target table definitions. > * Allow user to specify custom split semantics for efficient > parallelization of import jobs. > > = Current Status = > > == Meritocracy == > > Sqoop has been an open source project since its start. It was initially > developed by Aaron Kimball in May 2009 along with development team at > Cloudera and supplied as a patch to Hadoop project. Later it was moved to > GitHub as a Cloudera open-source project where Cloudera engineering team > has > since maintained it with Arvind Prabhakar and Ahmed Radwan dedicated > towards > its improvement. Developers external to Cloudera provided feedback, > suggested features and fixes and implemented extensions of Sqoop since its > inception. Contributors to Sqoop include developers from different > companies and different parts of the world. > > == Community == > > Sqoop is currently used by a number of organizations all over the world. > Sqoop has an active and growing user community with active participation in > [[https://groups.google.com/a/cloudera.org/group/sqoop-user/topics|user]] > and [[ > https://groups.google.com/a/cloudera.org/group/sqoop-dev/topics|developer > ]] > mailing lists. > > == Core Developers == > > The core developers for Sqoop project are: > * Aaron Kimball: Aaron designed and implemented much of the original code. > * Arvind Prabhakar: Has been working on Sqoop features and bug fixes. > * Ahmed Radwan: Has been working on Sqoop features and bug fixes. > * Jonathan Hsieh: Has started working on Sqoop features and bug fixes. > * Other contributors to the project include: Angus He, Brian Muller, Eli > Collins, Guy Le Mar, James Grant, Konstantin Boudnik, Lars Francke, Michael > Hausler, Michael Katzenellenbogen, Pter Happ and Scott Foster. > > All committers to Sqoop project have contributed towards Hadoop or related > Apache projects and are very familiar with Apache principals and philosophy > for community driven software development. > > == Alignment == > > Sqoop complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robust > mechanism to allow data integration from external systems for effective > data > analysis. It integrates with Hive and HBase currently and work is being > done > to integrate it with Pig. > > = Known Risks = > > == Orphaned Products == > > Sqoop is already deployed in production at multiple companies and they are > actively participating in feature requests and user led discussions. Sqoop > is getting traction with developers and thus the risks of it being orphaned > are minimal. > > == Inexperience with Open Source == > > All code developed for Sqoop has been open source from the start. The > initial part of Sqoop development was done within Hadoop project as a > contrib module. Since then it has been maintained as an Apache 2.0 licensed > open-source project on GitHub by Cloudera. > > All committers of Sqoop project are intimately familiar with the Apache > model for open-source development and are experienced with working with new > contributors. Aaron Kimball, the creator of the project and one of the > committers is also a committer on Apache MapReduce. > > == Homogeneous Developers == > > The initial set of committers is from a small set of organizations. > However, > we expect that once approved for incubation, the project will attract new > contributors from diverse organizations and will thus grow organically. The > participation of developers from several different organizations in the > mailing list is a strong indication for this assertion. > > == Reliance on Salaried Developers == > > It is expected that Sqoop will be developed on salaried and volunteer time, > although all of the initial developers will work on it mainly on salaried > time. > > == Relationships with Other Apache Products == > > Sqoop depends upon other Apache Projects: Hadoop, Hive, HBase Log4J and > multiple Apache commons components and build systems like Ant and Maven. > > == An Excessive Fascination with the Apache Brand == > > The reasons for joining Apache are to increase the synergy with other > Apache > Hadoop related projects and to foster a healthy community of contributors > and consumers around the project. This is facilitated by ASF and that is > the > primary reason we would like Sqoop to become an Apache project. > > = Documentation = > > * All Sqoop documentation is maintained within Sqoop sources and can be > built directly. > * Sqoop docs: http://archive.cloudera.com/cdh/3/sqoop/ > * Sqoop wiki at GitHub: https://github.com/cloudera/sqoop/wiki > * Sqoop jira at Cloudera: https://issues.cloudera.org/browse/sqoop > > = Initial Source = > > * https://github.com/cloudera/sqoop/tree/ > > == Source and Intellectual Property Submission Plan == > > * The initial source is already Apache 2.0 licensed. > > == External Dependencies == > > The required external dependencies are all Apache License or compatible > licenses. Following components with non-Apache licenses are enumerated: > > * HSQLDB: HSQLDB License - a BSD-based license. > > Non-Apache build tools that are used by Sqoop are as follows: > > * AsciiDoc: GNU GPLv2 > * Checkstyle: GNU LGPLv3 > * FindBugs: GNU LGPL > * Cobertura: GNU GPLv2 > > == Cryptography == > > Sqoop does not depend upon any cryptography tools or libraries. > > = Required Resources = > > == Mailing lists == > > * sqoop-private (with moderated subscriptions) > * sqoop-dev > * sqoop-commits > * sqoop-user > > == Subversion Directory == > > https://svn.apache.org/repos/asf/incubator/sqoop > > == Issue Tracing == > > JIRA Sqoop (SQOOP) > > == Other Resources == > > The existing code already has unit and integration tests so we would like a > Hudson instance to run them whenever a new patch is submitted. This can be > added after project creation. > > = Initial Committers = > > * Arvind Prabhakar (arvind at cloudera dot com) > * Ahmed Radwan (a dot aboelela at gmail dot com) > * Jonathan Hsieh (jon at cloudera dot com) > * Aaron Kimball (kimballa at apache dot org) > * Greg Cottman (greg dot cottman at quest dot com) > * Guy le Mar (guy dot lemar at quest dot com) > * Roman Shaposhnik (rvs at cloudera dot com) > * Andrew Bayer (andrew at cloudera dot com) > > A CLA is already on file for Aaron Kimball. > > = Affiliations = > > * Arvind Prabhakar, Cloudera > * Ahmed Radwan, Cloudera > * Jonathan Hsieh, Cloudera > * Aaron Kimball, Odiago > * Greg Cottman, Quest > * Guy le Mar, Quest > * Roman Shaposhnik, Cloudera > * Andrew Bayer, Cloudera > > = Sponsors = > > == Champion == > > * Tom White (tomwhite at apache dot org) > > == Nominated Mentors == > > * Patrick Hunt (phunt at apache dot org) > > == Sponsoring Entity == > > * Apache Incubator PMC > -- Eric Sammer twitter: esammer data: www.cloudera.com --bcaec520f311e855db04a44a8991--