Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 870DC4A48 for ; Fri, 27 May 2011 14:19:20 +0000 (UTC) Received: (qmail 98765 invoked by uid 500); 27 May 2011 14:19:20 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 98617 invoked by uid 500); 27 May 2011 14:19:19 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 98609 invoked by uid 99); 27 May 2011 14:19:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 May 2011 14:19:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jon@cloudera.com designates 209.85.213.47 as permitted sender) Received: from [209.85.213.47] (HELO mail-yw0-f47.google.com) (209.85.213.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 May 2011 14:19:15 +0000 Received: by ywg8 with SMTP id 8so799076ywg.6 for ; Fri, 27 May 2011 07:18:54 -0700 (PDT) Received: by 10.236.182.38 with SMTP id n26mr2853309yhm.183.1306505933180; Fri, 27 May 2011 07:18:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.147.125.14 with HTTP; Fri, 27 May 2011 07:18:33 -0700 (PDT) From: Jonathan Hsieh Date: Fri, 27 May 2011 07:18:33 -0700 Message-ID: Subject: [PROPOSAL] Flume for the Apache Incubator To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=20cf30563e010b507c04a4429e34 --20cf30563e010b507c04a4429e34 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Howdy! I would like to propose Flume to be an Apache Incubator project. Flume is = a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data to scalable data storage systems such as Apache Hadoop's HDFS. Here's a link to the proposal in the Incubator wiki http://wiki.apache.org/incubator/FlumeProposal I've also pasted the initial contents below. Thanks! Jon. =3D Flume - A Distributed Log Collection System =3D =3D=3D Abstract =3D=3D Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data to scalable data storage systems such as Apache Hadoop's HDFS. =3D=3D Proposal =3D=3D Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. Its main goal is to deliver data from applications to Hadoop=92s HDFS. It has a simple and flexible architecture for transporting streaming event data via flume nodes to the data store. It is robust and fault-tolerant with tunable reliability mechanisms that rely upon many failover and recovery mechanisms. The system is centrally configured and allows for intelligent dynamic management. It uses a simple extensible data model that allows for lightweight online analytic applications. It provides a pluggable mechanism by which new sources, destinations, and analytic functions which can be integrated withi= n a Flume pipeline. =3D=3D Background =3D=3D Flume was initially developed by Cloudera to enable reliable and simplified collection of log information from many distributed sources. It was later open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project in Jun= e 2010. During this time Flume has been formally released five times as versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 (Oct 2010), 0.9.2 (No= v 2010), and 0.9.3 (Feb 2011). These releases are also distributed by Cloudera as source and binaries along with enhancements as part of Cloudera Distribution including Apache Hadoop (CDH). =3D=3D Rationale =3D=3D Collecting log information in a data center in a timely, reliable, and efficient manner is a difficult challenge but important because when aggregated and analyzed, log information can yield valuable business insights. We believe that users and operators need a manageable systemati= c approach for log collection that simplifies the creation, the monitoring, and the administration of reliable log data pipelines. Oftentimes today, this collection is attempted by periodically shipping data in batches and b= y using potentially unreliable and inefficient ad-hoc methods. Log data is typically generated in various systems running within a data center that can range from a few machines to hundreds of machines. In aggregate, the data acts like a large-volume continuous stream with content= s that can have highly-varied format and highly-varied content. The volume and variety of raw log data makes Apache Hadoop's HDFS file system an ideal storage location before the eventual analysis. Unfortunately, HDFS has limitations with regards to durability as well as scaling limitations when handling a large number of low-bandwidth connections or small files. Similar technical challenges are also suffered when attempting to write data to other data storage services. Flume addresses these challenges by providing a reliable, scalable, manageable, and extensible solution. It uses a streaming design for capturing and aggregating log information from varied sources in a distributed environment and has centralized management features for minimal configuration and management overhead. =3D=3D Initial Goals =3D=3D Flume is currently in its first major release with a considerable number of enhancement requests, tasks, and issues recorded towards its future development. The initial goal of this project will be to continue to build community in the spirit of the "Apache Way", and to address the highly requested features and bug-fixes towards the next dot release. Some goals include: * To stand up a sustaining Apache-based community around the Flume codebase= . * Implementing core functionality of a usable highly-available Flume master= . * Performance, usability, and robustness improvements. * Improving the ability to monitor and diagnose problems as data is transported. * Providing a centralized place for contributed connectors and related projects. =3D Current Status =3D =3D=3D Meritocracy =3D=3D Flume was initially developed by Jonathan Hsieh in July 2009 along with development team at Cloudera. Developers external to Cloudera provided feedback, suggested features and fixes and implemented extensions of Flume. Cloudera engineering team has since maintained the project with Jonathan Hsieh, Henry Robinson, and Patrick Hunt dedicated towards its improvement. Contributors to Flume and its connectors include developers from different companies and different parts of the world. =3D=3D Community =3D=3D Flume is currently used by a number of organizations all over the world. Flume has an active and growing user and developer community with active participation in [user| https://groups.google.com/a/cloudera.org/group/flume-user/topics] and [developer|https://groups.google.com/a/cloudera.org/group/flume-dev/topics] mailing lists. The users and developers also communicate via IRC on #flume at irc.freenode.net. Since open sourcing the project, there have been over 15 different people from diverse organizations who have contributed code. During this period, the project team has hosted open, in-person, quarterly meetups to discuss new features, new designs, and new use-case stories. =3D=3D Core Developers =3D=3D The core developers for Flume project are: * Andrew Bayer: Andrew has a lot of expertise with build tools, specifically Jenkins continuous integration and Maven. * Jonathan Hsieh: Jonathan designed and implemented much of the original code. * Patrick Hunt: Patrick has improved the web interfaces of Flume component= s and contributed several build quality improvements. * Bruce Mitchener: Bruce has improved the internal logging infrastructure as well as edited significant portions of the Flume manual. * Henry Robinson: Henry has implemented much of the ZooKeeper integration, plugin mechanisms, as well as several Flume features and bug fixes. * Eric Sammer: Eric has implemented the Maven build, as well as several Flume features and bug fixes. All core developers of the Flume project have contributed towards Hadoop or related Apache projects and are very familiar with Apache principals and philosophy for community driven software development. =3D=3D Alignment =3D=3D Flume complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robust mechanism to allow log data integration from external systems for effective analysis. Its design enable efficient integration of newly ingested data t= o Hive's data warehouse. Flume's architecture is open and easily extensible. This has encouraged many users to contribute integrate plugins to other projects. For example, several users have contributed connectors to message queuing and bus services, to several open source data stores, to incremental search indexes= , and to a stream analysis engines. =3D Known Risks =3D =3D=3D Orphaned Products =3D=3D Flume is already deployed in production at multiple companies and they are actively participating in feature requests and user led discussions. Flume is getting traction with developers and thus the risks of it being orphaned are minimal. =3D=3D Inexperience with Open Source =3D=3D All code developed for Flume has is open sourced by Cloudera under Apache 2.0 license. All committers of Flume project are intimately familiar with the Apache model for open-source development and are experienced with working with new contributors. =3D=3D Homogeneous Developers =3D=3D The initial set of committers is from a reduced set of organizations. However, we expect that once approved for incubation, the project will attract new contributors from diverse organizations and will thus grow organically. The participation of developers from several different organizations in the mailing list is a strong indication for this assertion= . =3D=3D Reliance on Salaried Developers =3D=3D It is expected that Flume will be developed on salaried and volunteer time, although all of the initial developers will work on it mainly on salaried time. =3D=3D Relationships with Other Apache Products =3D=3D Flume depends upon other Apache Projects: Apache Hadoop, Apache Log4J, Apache ZooKeeper, Apache Thrift, Apache Avro, multiple Apache Commons components. Its build depends upon Apache Ant and Apache Maven. Flume users have created connectors that interact with several other Apache projects including Apache HBase and Apache Cassandra. Flume's functionality has some indirect or direct overlap with the functionality of Apache Chukwa but has several significant architectural diffferences. Both systems can be used to collect log data to write to hdfs. However, Chukwa's primary goals are the analytic and monitoring aspects of a Hadoop cluster. Instead of focusing on analytics, Flume focuses primarily upon data transport and integration with a wide set of data sources and data destinations. Architecturally, Chukwa components ar= e individually and statically configured. It also depends upon Hadoop MapReduce for its core functionality. In contrast, Flume's components are dynamically and centrally configured and does not depend directly upon Hadoop MapReduce. Furthermore, Flume provides a more general model for handling data and enables integration with projects such as Apache Hive, data stores such as Apache HBase, Apache Cassandra and Voldemort, and several Apache Lucene-related projects. =3D=3D An Excessive Fascination with the Apache Brand =3D=3D We would like Flume to become an Apache project to further foster a healthy community of contributors and consumers around the project. Since Flume directly interacts with many Apache Hadoop-related projects by solves an important problem of many Hadoop users, residing in the the Apache Software Foundation will increase interaction with the larger community. =3D Documentation =3D * All Flume documentation (User Guide, Developer Guide, Cookbook, and Windows Guide) is maintained within Flume sources and can be built directly= . * Cloudera provides documentation specific to its distribution of Flume at= : http://archive.cloudera.com/cdh/3/flume/ * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume =3D Initial Source =3D * https://github.com/cloudera/flume/tree/ =3D=3D Source and Intellectual Property Submission Plan =3D=3D * The initial source is already licensed under the Apache License, Version 2.0. https://github.com/cloudera/flume/blob/master/LICENSE =3D=3D External Dependencies =3D=3D The required external dependencies are all Apache License or compatible licenses. Following components with non-Apache licenses are enumerated: * org.arabidopsis.ahocorasick : BSD-style Non-Apache build tools that are used by Flume are as follows: * AsciiDoc: GNU GPLv2 * FindBugs: GNU LGPL * Cobertura: GNU GPLv2 * PMD : BSD-style =3D=3D Cryptography =3D=3D Flume uses standard APIs and tools for SSH and SSL communication where necessary. =3D Required Resources =3D =3D=3D Mailing lists =3D=3D * flume-private (with moderated subscriptions) * flume-dev * flume-commits * flume-user =3D=3D Subversion Directory =3D=3D https://svn.apache.org/repos/asf/incubator/flume =3D=3D Issue Tracking =3D=3D JIRA Flume (FLUME) =3D=3D Other Resources =3D=3D The existing code already has unit and integration tests so we would like a Hudson instance to run them whenever a new patch is submitted. This can be added after project creation. =3D Initial Committers =3D * Andrew Bayer (abayer at cloudera dot com) * Jonathan Hsieh (jon at cloudera dot com) * Aaron Kimball (akimball83 at gmail dot com) * Bruce Mitchener (bruce.mitchener at gmail dot com) * Arvind Prabhakar (arvind at cloudera dot com) * Ahmed Radwan (ahmed at cloudera dot com) * Henry Robinson (henry at cloudera dot com) * Eric Sammer (esammer at cloudera dot com) =3D Affiliations =3D * Andrew Bayer, Cloudera * Jonathan Hsieh, Cloudera * Aaron Kimball, Odiago * Bruce Mitchener, Independent * Arvind Prabhakar, Cloudera * Ahmed Radwan, Cloudera * Henry Robinson, Cloudera * Eric Sammer, Cloudera =3D Sponsors =3D =3D=3D Champion =3D=3D * Nigel Daley =3D=3D Nominated Mentors =3D=3D * Tom White * Nigel Daley =3D=3D Sponsoring Entity =3D=3D * Apache Incubator PMC --=20 // Jonathan Hsieh (shay) // Software Engineer, Cloudera // jon@cloudera.com --20cf30563e010b507c04a4429e34--