Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B99EA4F7F for ; Sat, 28 May 2011 01:08:28 +0000 (UTC) Received: (qmail 42103 invoked by uid 500); 28 May 2011 01:08:28 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 41933 invoked by uid 500); 28 May 2011 01:08:27 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 41925 invoked by uid 99); 28 May 2011 01:08:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 May 2011 01:08:27 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ralph.goers@dslextreme.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-px0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 May 2011 01:08:20 +0000 Received: by pxi17 with SMTP id 17so1857835pxi.6 for ; Fri, 27 May 2011 18:07:59 -0700 (PDT) Received: by 10.68.38.33 with SMTP id d1mr1076531pbk.389.1306544879281; Fri, 27 May 2011 18:07:59 -0700 (PDT) Received: from [192.168.10.132] (cpe-75-82-178-177.socal.res.rr.com [75.82.178.177]) by mx.google.com with ESMTPS id w2sm911863pbg.5.2011.05.27.18.07.57 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 27 May 2011 18:07:58 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: [PROPOSAL] Flume for the Apache Incubator From: Ralph Goers In-Reply-To: Date: Fri, 27 May 2011 18:07:56 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <44C8C2A6-40B8-4A40-8AB9-DF4D275C5269@dslextreme.com> References: To: general@incubator.apache.org X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org A hearty +1 from me. Do you need another mentor? Ralph On May 27, 2011, at 7:18 AM, Jonathan Hsieh wrote: > Howdy! >=20 > I would like to propose Flume to be an Apache Incubator project. = Flume is a > distributed, reliable, and available system for efficiently = collecting, > aggregating, and moving large amounts of log data to scalable data = storage > systems such as Apache Hadoop's HDFS. >=20 > Here's a link to the proposal in the Incubator wiki > http://wiki.apache.org/incubator/FlumeProposal >=20 > I've also pasted the initial contents below. >=20 > Thanks! > Jon. >=20 > =3D Flume - A Distributed Log Collection System =3D >=20 > =3D=3D Abstract =3D=3D >=20 > Flume is a distributed, reliable, and available system for efficiently > collecting, aggregating, and moving large amounts of log data to = scalable > data storage systems such as Apache Hadoop's HDFS. >=20 > =3D=3D Proposal =3D=3D >=20 > Flume is a distributed, reliable, and available system for efficiently > collecting, aggregating, and moving large amounts of log data from = many > different sources to a centralized data store. Its main goal is to = deliver > data from applications to Hadoop=92s HDFS. It has a simple and = flexible > architecture for transporting streaming event data via flume nodes to = the > data store. It is robust and fault-tolerant with tunable reliability > mechanisms that rely upon many failover and recovery mechanisms. The = system > is centrally configured and allows for intelligent dynamic management. = It > uses a simple extensible data model that allows for lightweight online > analytic applications. It provides a pluggable mechanism by which new > sources, destinations, and analytic functions which can be integrated = within > a Flume pipeline. >=20 > =3D=3D Background =3D=3D >=20 > Flume was initially developed by Cloudera to enable reliable and = simplified > collection of log information from many distributed sources. It was = later > open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project = in June > 2010. During this time Flume has been formally released five times as > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 (Oct 2010), = 0.9.2 (Nov > 2010), and 0.9.3 (Feb 2011). These releases are also distributed by > Cloudera as source and binaries along with enhancements as part of = Cloudera > Distribution including Apache Hadoop (CDH). >=20 > =3D=3D Rationale =3D=3D >=20 > Collecting log information in a data center in a timely, reliable, and > efficient manner is a difficult challenge but important because when > aggregated and analyzed, log information can yield valuable business > insights. We believe that users and operators need a manageable = systematic > approach for log collection that simplifies the creation, the = monitoring, > and the administration of reliable log data pipelines. Oftentimes = today, > this collection is attempted by periodically shipping data in batches = and by > using potentially unreliable and inefficient ad-hoc methods. >=20 > Log data is typically generated in various systems running within a = data > center that can range from a few machines to hundreds of machines. In > aggregate, the data acts like a large-volume continuous stream with = contents > that can have highly-varied format and highly-varied content. The = volume > and variety of raw log data makes Apache Hadoop's HDFS file system an = ideal > storage location before the eventual analysis. Unfortunately, HDFS = has > limitations with regards to durability as well as scaling limitations = when > handling a large number of low-bandwidth connections or small files. > Similar technical challenges are also suffered when attempting to = write > data to other data storage services. >=20 > Flume addresses these challenges by providing a reliable, scalable, > manageable, and extensible solution. It uses a streaming design for > capturing and aggregating log information from varied sources in a > distributed environment and has centralized management features for = minimal > configuration and management overhead. >=20 > =3D=3D Initial Goals =3D=3D >=20 > Flume is currently in its first major release with a considerable = number of > enhancement requests, tasks, and issues recorded towards its future > development. The initial goal of this project will be to continue to = build > community in the spirit of the "Apache Way", and to address the highly > requested features and bug-fixes towards the next dot release. >=20 > Some goals include: > * To stand up a sustaining Apache-based community around the Flume = codebase. > * Implementing core functionality of a usable highly-available Flume = master. > * Performance, usability, and robustness improvements. > * Improving the ability to monitor and diagnose problems as data is > transported. > * Providing a centralized place for contributed connectors and related > projects. >=20 > =3D Current Status =3D >=20 > =3D=3D Meritocracy =3D=3D >=20 > Flume was initially developed by Jonathan Hsieh in July 2009 along = with > development team at Cloudera. Developers external to Cloudera provided > feedback, suggested features and fixes and implemented extensions of = Flume. > Cloudera engineering team has since maintained the project with = Jonathan > Hsieh, Henry Robinson, and Patrick Hunt dedicated towards its = improvement. > Contributors to Flume and its connectors include developers from = different > companies and different parts of the world. >=20 > =3D=3D Community =3D=3D >=20 > Flume is currently used by a number of organizations all over the = world. > Flume has an active and growing user and developer community with = active > participation in [user| > https://groups.google.com/a/cloudera.org/group/flume-user/topics] and > = [developer|https://groups.google.com/a/cloudera.org/group/flume-dev/topics= ] > mailing lists. The users and developers also communicate via IRC on = #flume > at irc.freenode.net. >=20 > Since open sourcing the project, there have been over 15 different = people > from diverse organizations who have contributed code. During this = period, > the project team has hosted open, in-person, quarterly meetups to = discuss > new features, new designs, and new use-case stories. >=20 > =3D=3D Core Developers =3D=3D >=20 > The core developers for Flume project are: > * Andrew Bayer: Andrew has a lot of expertise with build tools, > specifically Jenkins continuous integration and Maven. > * Jonathan Hsieh: Jonathan designed and implemented much of the = original > code. > * Patrick Hunt: Patrick has improved the web interfaces of Flume = components > and contributed several build quality improvements. > * Bruce Mitchener: Bruce has improved the internal logging = infrastructure > as well as edited significant portions of the Flume manual. > * Henry Robinson: Henry has implemented much of the ZooKeeper = integration, > plugin mechanisms, as well as several Flume features and bug fixes. > * Eric Sammer: Eric has implemented the Maven build, as well as = several > Flume features and bug fixes. >=20 > All core developers of the Flume project have contributed towards = Hadoop or > related Apache projects and are very familiar with Apache principals = and > philosophy for community driven software development. >=20 > =3D=3D Alignment =3D=3D >=20 > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a = robust > mechanism to allow log data integration from external systems for = effective > analysis. Its design enable efficient integration of newly ingested = data to > Hive's data warehouse. >=20 > Flume's architecture is open and easily extensible. This has = encouraged > many users to contribute integrate plugins to other projects. For = example, > several users have contributed connectors to message queuing and bus > services, to several open source data stores, to incremental search = indexes, > and to a stream analysis engines. >=20 > =3D Known Risks =3D >=20 > =3D=3D Orphaned Products =3D=3D >=20 > Flume is already deployed in production at multiple companies and they = are > actively participating in feature requests and user led discussions. = Flume > is getting traction with developers and thus the risks of it being = orphaned > are minimal. >=20 > =3D=3D Inexperience with Open Source =3D=3D >=20 > All code developed for Flume has is open sourced by Cloudera under = Apache > 2.0 license. All committers of Flume project are intimately familiar = with > the Apache model for open-source development and are experienced with > working with new contributors. >=20 > =3D=3D Homogeneous Developers =3D=3D >=20 > The initial set of committers is from a reduced set of organizations. > However, we expect that once approved for incubation, the project will > attract new contributors from diverse organizations and will thus grow > organically. The participation of developers from several different > organizations in the mailing list is a strong indication for this = assertion. >=20 > =3D=3D Reliance on Salaried Developers =3D=3D >=20 > It is expected that Flume will be developed on salaried and volunteer = time, > although all of the initial developers will work on it mainly on = salaried > time. >=20 > =3D=3D Relationships with Other Apache Products =3D=3D >=20 > Flume depends upon other Apache Projects: Apache Hadoop, Apache Log4J, > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple Apache Commons > components. Its build depends upon Apache Ant and Apache Maven. >=20 > Flume users have created connectors that interact with several other = Apache > projects including Apache HBase and Apache Cassandra. >=20 > Flume's functionality has some indirect or direct overlap with the > functionality of Apache Chukwa but has several significant = architectural > diffferences. Both systems can be used to collect log data to write = to > hdfs. However, Chukwa's primary goals are the analytic and monitoring > aspects of a Hadoop cluster. Instead of focusing on analytics, Flume > focuses primarily upon data transport and integration with a wide set = of > data sources and data destinations. Architecturally, Chukwa = components are > individually and statically configured. It also depends upon Hadoop > MapReduce for its core functionality. In contrast, Flume's components = are > dynamically and centrally configured and does not depend directly upon > Hadoop MapReduce. Furthermore, Flume provides a more general model = for > handling data and enables integration with projects such as Apache = Hive, > data stores such as Apache HBase, Apache Cassandra and Voldemort, and > several Apache Lucene-related projects. >=20 > =3D=3D An Excessive Fascination with the Apache Brand =3D=3D >=20 > We would like Flume to become an Apache project to further foster a = healthy > community of contributors and consumers around the project. Since = Flume > directly interacts with many Apache Hadoop-related projects by solves = an > important problem of many Hadoop users, residing in the the Apache = Software > Foundation will increase interaction with the larger community. >=20 > =3D Documentation =3D >=20 > * All Flume documentation (User Guide, Developer Guide, Cookbook, and > Windows Guide) is maintained within Flume sources and can be built = directly. > * Cloudera provides documentation specific to its distribution of = Flume at: > http://archive.cloudera.com/cdh/3/flume/ > * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki > * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume >=20 > =3D Initial Source =3D >=20 > * https://github.com/cloudera/flume/tree/ >=20 > =3D=3D Source and Intellectual Property Submission Plan =3D=3D >=20 > * The initial source is already licensed under the Apache License, = Version > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE >=20 > =3D=3D External Dependencies =3D=3D >=20 > The required external dependencies are all Apache License or = compatible > licenses. Following components with non-Apache licenses are = enumerated: >=20 > * org.arabidopsis.ahocorasick : BSD-style >=20 > Non-Apache build tools that are used by Flume are as follows: >=20 > * AsciiDoc: GNU GPLv2 > * FindBugs: GNU LGPL > * Cobertura: GNU GPLv2 > * PMD : BSD-style >=20 > =3D=3D Cryptography =3D=3D >=20 > Flume uses standard APIs and tools for SSH and SSL communication where > necessary. >=20 > =3D Required Resources =3D >=20 > =3D=3D Mailing lists =3D=3D >=20 > * flume-private (with moderated subscriptions) > * flume-dev > * flume-commits > * flume-user >=20 > =3D=3D Subversion Directory =3D=3D >=20 > https://svn.apache.org/repos/asf/incubator/flume >=20 > =3D=3D Issue Tracking =3D=3D >=20 > JIRA Flume (FLUME) >=20 > =3D=3D Other Resources =3D=3D >=20 > The existing code already has unit and integration tests so we would = like a > Hudson instance to run them whenever a new patch is submitted. This = can be > added after project creation. >=20 > =3D Initial Committers =3D >=20 > * Andrew Bayer (abayer at cloudera dot com) > * Jonathan Hsieh (jon at cloudera dot com) > * Aaron Kimball (akimball83 at gmail dot com) > * Bruce Mitchener (bruce.mitchener at gmail dot com) > * Arvind Prabhakar (arvind at cloudera dot com) > * Ahmed Radwan (ahmed at cloudera dot com) > * Henry Robinson (henry at cloudera dot com) > * Eric Sammer (esammer at cloudera dot com) >=20 > =3D Affiliations =3D >=20 > * Andrew Bayer, Cloudera > * Jonathan Hsieh, Cloudera > * Aaron Kimball, Odiago > * Bruce Mitchener, Independent > * Arvind Prabhakar, Cloudera > * Ahmed Radwan, Cloudera > * Henry Robinson, Cloudera > * Eric Sammer, Cloudera >=20 >=20 > =3D Sponsors =3D >=20 > =3D=3D Champion =3D=3D >=20 > * Nigel Daley >=20 > =3D=3D Nominated Mentors =3D=3D >=20 > * Tom White > * Nigel Daley >=20 > =3D=3D Sponsoring Entity =3D=3D >=20 > * Apache Incubator PMC >=20 >=20 > --=20 > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // jon@cloudera.com --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org