Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2537A42B7 for ; Sun, 29 May 2011 05:50:45 +0000 (UTC) Received: (qmail 60807 invoked by uid 500); 29 May 2011 05:50:40 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 60025 invoked by uid 500); 29 May 2011 05:50:38 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 60017 invoked by uid 99); 29 May 2011 05:50:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 May 2011 05:50:35 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [128.149.139.105] (HELO mail.jpl.nasa.gov) (128.149.139.105) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 May 2011 05:50:28 +0000 Received: from mail.jpl.nasa.gov (altvirehtstap02.jpl.nasa.gov [128.149.137.73]) by smtp.jpl.nasa.gov (Switch-3.4.3/Switch-3.4.3) with ESMTP id p4T5o6du003976 (using TLSv1/SSLv3 with cipher RC4-MD5 (128 bits) verified NO) for ; Sat, 28 May 2011 22:50:07 -0700 Received: from ALTPHYEMBEVSP20.RES.AD.JPL ([128.149.137.82]) by ALTVIREHTSTAP02.RES.AD.JPL ([128.149.137.73]) with mapi; Sat, 28 May 2011 22:50:05 -0700 From: "Mattmann, Chris A (388J)" To: "general@incubator.apache.org" Date: Sat, 28 May 2011 22:49:43 -0700 Subject: Re: [PROPOSAL] Sqoop Project Thread-Topic: [PROPOSAL] Sqoop Project Thread-Index: AcwdxEKiGoDKr0xKSH2FW21arhFkEg== Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Source-IP: altvirehtstap02.jpl.nasa.gov [128.149.137.73] X-Source-Sender: chris.a.mattmann@jpl.nasa.gov X-AUTH: Authorized X-Virus-Checked: Checked by ClamAV on apache.org Great tool! Would be very happy for this to enter the Incubator... Cheers, Chris On May 27, 2011, at 11:40 AM, arvind@cloudera.com wrote: > Greetings All, >=20 > We would like to propose Sqoop Project for inclusion in ASF Incubator as = a > new podling. Sqoop is a tool designed for efficiently transferring bulk d= ata > between Apache Hadoop and structured datastores such as relational > databases. The complete proposal can be found at: >=20 > http://wiki.apache.org/incubator/SqoopProposal >=20 > The initial contents of this proposal are also pasted below for convenien= ce. >=20 > Thanks and Regards, > Arvind Prabhakar >=20 > =3D Sqoop - A Data Transfer Tool for Hadoop =3D >=20 > =3D=3D Abstract =3D=3D >=20 > Sqoop is a tool designed for efficiently transferring bulk data between > Apache Hadoop and structured datastores such as relational databases. You > can use Sqoop to import data from external structured datastores into Had= oop > Distributed File System or related systems like Hive and HBase. Conversel= y, > Sqoop can be used to extract data from Hadoop and export it to external > structured datastores such as relational databases and enterprise data > warehouses. >=20 > =3D=3D Proposal =3D=3D >=20 > Hadoop and related systems operate on large volumes of data. Typically th= is > data originates from outside of Hadoop infrastructure and must be > provisioned for consumption by Hadoop and related systems for analysis an= d > processing. Sqoop allows fast provisioning of data into Hadoop and relate= d > systems by providing a bulk import and export mechanism that enables > consumers to effectively use Hadoop for data analysis and processing. >=20 > =3D=3D Background =3D=3D >=20 > Sqoop was initially developed by Cloudera to enable the import and export= of > data between various databases and Hadoop Distributed File System (HDFS).= It > was provided as a patch to Hadoop project via the issue [[ > https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was > maintained as a contrib module to Hadoop between May 2009 to April 2010. = In > April 2010, Sqoop was removed from Hadoop contrib via [[ > https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]] and > was made available by Cloudera on [[http://github.com/cloudera/sqoop|GitH= ub]]. >=20 >=20 > Since then Sqoop has been maintained by Cloudera as an open source projec= t > on GitHub. All code available in Sqoop is open source and made publicaly > available under the Apache 2 license. During this time Sqoop has been > formally released three times as versions 1.0, 1.1 and 1.2. >=20 > =3D=3D Rationale =3D=3D >=20 > Hadoop is often used to process data that originated or is later served b= y > structured data stores such as relational databases, spreadsheets or > enterprise data warehouses. Unfortunately, current methods of transferrin= g > data are inefficient and ad hoc, often consisting of manual steps specifi= c > to the external system. These steps are necessary to help provision this > data for consumption by Map-Reduce jobs, or by systems that build on top = of > Hadoop such as Hive and Pig. The transfer of this data can take substanti= al > amount of time depending upon its size. An optimal transfer approach that > works well with one particular datastore will typically not work as > optimally with another datastore due to inherent architectural difference= s > between different datastore implementations. Sqoop addresses this problem= by > providing connectivity of Hadoop with external systems via pluggable > connectors. Specialized connectors are developed for optimal performance = for > data transfer between Hadoop and target systems. >=20 > Analyzed and processed data from Hadoop and related systems may also requ= ire > to be provisioned outside of Hadoop for consumption by business > applications. Sqoop allows the export of data from Hadoop to external > systems to facilitate its use in other systems. This too, like the import > scenario, is implemented via specialized connectors that are built for th= e > purposes of optimal integration between Hadoop and external systems. >=20 > Connectors can be built for systems that Sqoop does not yet integrate wit= h > and thus can be easily incorporated into Sqoop. Connectors allow Sqoop to > interface with external systems of different types, ensuring that newer > systems can integrate with Hadoop with relative ease and in a consistent > manner. >=20 > Besides allowing integration with other external systems, Sqoop provides > tight integration with systems that build on to of Hadoop such as Hive, > HBase etc - thus providing data integration between Hadoop based systems = and > external systems in a single step manner. >=20 > =3D=3D Initial Goals =3D=3D >=20 > Sqoop is currently in its first major release with a considerable number = of > enhancement requests, tasks, and issues logged towards its future > development. The initial goal of this project will be to address the high= ly > requested features and bug-fixes towards its next dot release. The key > features of interest are the following: > * Support for bulk import into Apache HBase. > * Allow user to supply password in permission protected file. > * Support for pluggable query to help Sqoop identify the metadata > associated with the source or target table definitions. > * Allow user to specify custom split semantics for efficient > parallelization of import jobs. >=20 > =3D Current Status =3D >=20 > =3D=3D Meritocracy =3D=3D >=20 > Sqoop has been an open source project since its start. It was initially > developed by Aaron Kimball in May 2009 along with development team at > Cloudera and supplied as a patch to Hadoop project. Later it was moved to > GitHub as a Cloudera open-source project where Cloudera engineering team = has > since maintained it with Arvind Prabhakar and Ahmed Radwan dedicated towa= rds > its improvement. Developers external to Cloudera provided feedback, > suggested features and fixes and implemented extensions of Sqoop since it= s > inception. Contributors to Sqoop include developers from different > companies and different parts of the world. >=20 > =3D=3D Community =3D=3D >=20 > Sqoop is currently used by a number of organizations all over the world. > Sqoop has an active and growing user community with active participation = in > [[https://groups.google.com/a/cloudera.org/group/sqoop-user/topics|user]] > and [[ > https://groups.google.com/a/cloudera.org/group/sqoop-dev/topics|developer= ]] > mailing lists. >=20 > =3D=3D Core Developers =3D=3D >=20 > The core developers for Sqoop project are: > * Aaron Kimball: Aaron designed and implemented much of the original code= . > * Arvind Prabhakar: Has been working on Sqoop features and bug fixes. > * Ahmed Radwan: Has been working on Sqoop features and bug fixes. > * Jonathan Hsieh: Has started working on Sqoop features and bug fixes. > * Other contributors to the project include: Angus He, Brian Muller, Eli > Collins, Guy Le Mar, James Grant, Konstantin Boudnik, Lars Francke, Micha= el > Hausler, Michael Katzenellenbogen, Pter Happ and Scott Foster. >=20 > All committers to Sqoop project have contributed towards Hadoop or relate= d > Apache projects and are very familiar with Apache principals and philosop= hy > for community driven software development. >=20 > =3D=3D Alignment =3D=3D >=20 > Sqoop complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robu= st > mechanism to allow data integration from external systems for effective d= ata > analysis. It integrates with Hive and HBase currently and work is being d= one > to integrate it with Pig. >=20 > =3D Known Risks =3D >=20 > =3D=3D Orphaned Products =3D=3D >=20 > Sqoop is already deployed in production at multiple companies and they ar= e > actively participating in feature requests and user led discussions. Sqoo= p > is getting traction with developers and thus the risks of it being orphan= ed > are minimal. >=20 > =3D=3D Inexperience with Open Source =3D=3D >=20 > All code developed for Sqoop has been open source from the start. The > initial part of Sqoop development was done within Hadoop project as a > contrib module. Since then it has been maintained as an Apache 2.0 licens= ed > open-source project on GitHub by Cloudera. >=20 > All committers of Sqoop project are intimately familiar with the Apache > model for open-source development and are experienced with working with n= ew > contributors. Aaron Kimball, the creator of the project and one of the > committers is also a committer on Apache MapReduce. >=20 > =3D=3D Homogeneous Developers =3D=3D >=20 > The initial set of committers is from a small set of organizations. Howev= er, > we expect that once approved for incubation, the project will attract new > contributors from diverse organizations and will thus grow organically. T= he > participation of developers from several different organizations in the > mailing list is a strong indication for this assertion. >=20 > =3D=3D Reliance on Salaried Developers =3D=3D >=20 > It is expected that Sqoop will be developed on salaried and volunteer tim= e, > although all of the initial developers will work on it mainly on salaried > time. >=20 > =3D=3D Relationships with Other Apache Products =3D=3D >=20 > Sqoop depends upon other Apache Projects: Hadoop, Hive, HBase Log4J and > multiple Apache commons components and build systems like Ant and Maven. >=20 > =3D=3D An Excessive Fascination with the Apache Brand =3D=3D >=20 > The reasons for joining Apache are to increase the synergy with other Apa= che > Hadoop related projects and to foster a healthy community of contributors > and consumers around the project. This is facilitated by ASF and that is = the > primary reason we would like Sqoop to become an Apache project. >=20 > =3D Documentation =3D >=20 > * All Sqoop documentation is maintained within Sqoop sources and can be > built directly. > * Sqoop docs: http://archive.cloudera.com/cdh/3/sqoop/ > * Sqoop wiki at GitHub: https://github.com/cloudera/sqoop/wiki > * Sqoop jira at Cloudera: https://issues.cloudera.org/browse/sqoop >=20 > =3D Initial Source =3D >=20 > * https://github.com/cloudera/sqoop/tree/ >=20 > =3D=3D Source and Intellectual Property Submission Plan =3D=3D >=20 > * The initial source is already Apache 2.0 licensed. >=20 > =3D=3D External Dependencies =3D=3D >=20 > The required external dependencies are all Apache License or compatible > licenses. Following components with non-Apache licenses are enumerated: >=20 > * HSQLDB: HSQLDB License - a BSD-based license. >=20 > Non-Apache build tools that are used by Sqoop are as follows: >=20 > * AsciiDoc: GNU GPLv2 > * Checkstyle: GNU LGPLv3 > * FindBugs: GNU LGPL > * Cobertura: GNU GPLv2 >=20 > =3D=3D Cryptography =3D=3D >=20 > Sqoop does not depend upon any cryptography tools or libraries. >=20 > =3D Required Resources =3D >=20 > =3D=3D Mailing lists =3D=3D >=20 > * sqoop-private (with moderated subscriptions) > * sqoop-dev > * sqoop-commits > * sqoop-user >=20 > =3D=3D Subversion Directory =3D=3D >=20 > https://svn.apache.org/repos/asf/incubator/sqoop >=20 > =3D=3D Issue Tracing =3D=3D >=20 > JIRA Sqoop (SQOOP) >=20 > =3D=3D Other Resources =3D=3D >=20 > The existing code already has unit and integration tests so we would like= a > Hudson instance to run them whenever a new patch is submitted. This can b= e > added after project creation. >=20 > =3D Initial Committers =3D >=20 > * Arvind Prabhakar (arvind at cloudera dot com) > * Ahmed Radwan (a dot aboelela at gmail dot com) > * Jonathan Hsieh (jon at cloudera dot com) > * Aaron Kimball (kimballa at apache dot org) > * Greg Cottman (greg dot cottman at quest dot com) > * Guy le Mar (guy dot lemar at quest dot com) > * Roman Shaposhnik (rvs at cloudera dot com) > * Andrew Bayer (andrew at cloudera dot com) >=20 > A CLA is already on file for Aaron Kimball. >=20 > =3D Affiliations =3D >=20 > * Arvind Prabhakar, Cloudera > * Ahmed Radwan, Cloudera > * Jonathan Hsieh, Cloudera > * Aaron Kimball, Odiago > * Greg Cottman, Quest > * Guy le Mar, Quest > * Roman Shaposhnik, Cloudera > * Andrew Bayer, Cloudera >=20 > =3D Sponsors =3D >=20 > =3D=3D Champion =3D=3D >=20 > * Tom White (tomwhite at apache dot org) >=20 > =3D=3D Nominated Mentors =3D=3D >=20 > * Patrick Hunt (phunt at apache dot org) >=20 > =3D=3D Sponsoring Entity =3D=3D >=20 > * Apache Incubator PMC ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattmann@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org