Mailing-List: contact general-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of paliwalashish@gmail.com
 designates 209.85.220.175 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=PzY9IlYh3c7eQLtn/kTUwhdQXOS+37mqL+Jeiuk5ICuU+qBg76LbaC/6fL+xBUfJ/h
         60zbI0Rv6SWAwdO9+TAtnWYt52OBc0tRJIRTyE58DBSchgg1l/ZoZb6uTD3mPSohRdQk
         jB8/lDq0Cx0aJ9JUlulsiapSkgirc8Pn2Qd7M=
MIME-Version: 1.0
In-Reply-To: <BANLkTikpFomGb_5NG7=0yOJ7pU025a2gig@mail.gmail.com>
References: <BANLkTikpFomGb_5NG7=0yOJ7pU025a2gig@mail.gmail.com>
Date: Sat, 28 May 2011 10:54:35 +0530
Message-ID: <BANLkTi=Ju48SMR44KZkqpLmYQ8h5nT1dRA@mail.gmail.com>
Subject: Re: [PROPOSAL] Flume for the Apache Incubator
From: Ashish <paliwalashish@gmail.com>
To: general@incubator.apache.org
Content-Type: multipart/alternative; boundary=bcaec547c9d313d2b904a44f45f0

--bcaec547c9d313d2b904a44f45f0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

+1

On Fri, May 27, 2011 at 7:48 PM, Jonathan Hsieh <jon@cloudera.com> wrote:

> Howdy!
>
> I would like to propose Flume to be an Apache Incubator project.  Flume i=
s
> a
> distributed, reliable, and available system for efficiently collecting,
> aggregating, and moving large amounts of log data to scalable data storag=
e
> systems such as Apache Hadoop's HDFS.
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/FlumeProposal
>
> I've also pasted the initial contents below.
>
> Thanks!
> Jon.
>
> =3D Flume - A Distributed Log Collection System =3D
>
> =3D=3D Abstract =3D=3D
>
> Flume is a distributed, reliable, and available system for efficiently
> collecting, aggregating, and moving large amounts of log data to scalable
> data storage systems such as Apache Hadoop's HDFS.
>
> =3D=3D Proposal =3D=3D
>
> Flume is a distributed, reliable, and available system for efficiently
> collecting, aggregating, and moving large amounts of log data from many
> different sources to a centralized data store. Its main goal is to delive=
r
> data from applications to Hadoop=92s HDFS.  It has a simple and flexible
> architecture for transporting streaming event data via flume nodes to the
> data store.  It is robust and fault-tolerant with tunable reliability
> mechanisms that rely upon many failover and recovery mechanisms. The syst=
em
> is centrally configured and allows for intelligent dynamic management. It
> uses a simple extensible data model that allows for lightweight online
> analytic applications.  It provides a pluggable mechanism by which new
> sources, destinations, and analytic functions which can be integrated
> within
> a Flume pipeline.
>
> =3D=3D Background =3D=3D
>
> Flume was initially developed by Cloudera to enable reliable and simplifi=
ed
> collection of log information from many distributed sources. It was later
> open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project in
> June
> 2010. During this time Flume has been formally released five times as
> versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 (Oct 2010), 0.9.2
> (Nov
> 2010), and 0.9.3 (Feb 2011).  These releases are also distributed by
> Cloudera as source and binaries along with enhancements as part of Cloude=
ra
> Distribution including Apache Hadoop (CDH).
>
> =3D=3D Rationale =3D=3D
>
> Collecting log information in a data center in a timely, reliable, and
> efficient manner is a difficult challenge but important because when
> aggregated and analyzed, log information can yield valuable business
> insights.   We believe that users and operators need a manageable
> systematic
> approach for log collection that simplifies the creation, the monitoring,
> and the administration of reliable log data pipelines.  Oftentimes today,
> this collection is attempted by periodically shipping data in batches and
> by
> using potentially unreliable and inefficient ad-hoc methods.
>
> Log data is typically generated in various systems running within a data
> center that can range from a few machines to hundreds of machines.  In
> aggregate, the data acts like a large-volume continuous stream with
> contents
> that can have highly-varied format and highly-varied content.  The volume
> and variety of raw log data makes Apache Hadoop's HDFS file system an ide=
al
> storage location before the eventual analysis.  Unfortunately, HDFS has
> limitations with regards to durability as well as scaling limitations whe=
n
> handling a large number of low-bandwidth connections or small files.
>  Similar technical challenges are also suffered when attempting to write
> data to other data storage services.
>
> Flume addresses these challenges by providing a reliable, scalable,
> manageable, and extensible solution.  It uses a streaming design for
> capturing and aggregating log information from varied sources in a
> distributed environment and has centralized management features for minim=
al
> configuration and management overhead.
>
> =3D=3D Initial Goals =3D=3D
>
> Flume is currently in its first major release with a considerable number =
of
> enhancement requests, tasks, and issues recorded towards its future
> development. The initial goal of this project will be to continue to buil=
d
> community in the spirit of the "Apache Way", and to address the highly
> requested features and bug-fixes towards the next dot release.
>
> Some goals include:
> * To stand up a sustaining Apache-based community around the Flume
> codebase.
> * Implementing core functionality of a usable highly-available Flume
> master.
> * Performance, usability, and robustness improvements.
> * Improving the ability to monitor and diagnose problems as data is
> transported.
> * Providing a centralized place for contributed connectors and related
> projects.
>
> =3D Current Status =3D
>
> =3D=3D Meritocracy =3D=3D
>
> Flume was initially developed by Jonathan Hsieh in July 2009 along with
> development team at Cloudera. Developers external to Cloudera provided
> feedback, suggested features and fixes and implemented extensions of Flum=
e.
> Cloudera engineering team has since maintained the project with Jonathan
> Hsieh, Henry Robinson, and Patrick Hunt dedicated towards its improvement=
.
> Contributors to Flume and its connectors include developers from differen=
t
> companies and different parts of the world.
>
> =3D=3D Community =3D=3D
>
> Flume is currently used by a number of organizations all over the world.
> Flume has an active and growing user and developer community with active
> participation in [user|
> https://groups.google.com/a/cloudera.org/group/flume-user/topics] and
> [developer|https://groups.google.com/a/cloudera.org/group/flume-dev/topic=
s
> ]
> mailing lists.  The users and developers also communicate via IRC on #flu=
me
> at irc.freenode.net.
>
> Since open sourcing the project, there have been over 15 different people
> from diverse organizations who have contributed code. During this period,
> the project team has hosted open, in-person, quarterly meetups to discuss
> new features, new designs, and new use-case stories.
>
> =3D=3D Core Developers =3D=3D
>
> The core developers for Flume project are:
>  * Andrew Bayer: Andrew has a lot of expertise with build tools,
> specifically Jenkins continuous integration and Maven.
>  * Jonathan Hsieh: Jonathan designed and implemented much of the original
> code.
>  * Patrick Hunt: Patrick has improved the web interfaces of Flume
> components
> and contributed several build quality  improvements.
>  * Bruce Mitchener: Bruce has improved the internal logging infrastructur=
e
> as well as edited significant portions of the Flume manual.
>  * Henry Robinson: Henry has implemented much of the ZooKeeper integratio=
n,
> plugin mechanisms, as well as several Flume features and bug fixes.
>  * Eric Sammer: Eric has implemented the Maven build, as well as several
> Flume features and bug fixes.
>
> All core developers of the Flume project have contributed towards Hadoop =
or
> related Apache projects and are very familiar with Apache principals and
> philosophy for community driven software development.
>
> =3D=3D Alignment =3D=3D
>
> Flume complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robu=
st
> mechanism to allow log data integration from external systems for effecti=
ve
> analysis.  Its design enable efficient integration of newly ingested data
> to
> Hive's data warehouse.
>
> Flume's architecture is open and easily extensible.  This has encouraged
> many users to contribute integrate plugins to other projects.  For exampl=
e,
> several users have contributed connectors to message queuing and bus
> services, to several open source data stores, to incremental search
> indexes,
> and to a stream analysis engines.
>
> =3D Known Risks =3D
>
> =3D=3D Orphaned Products =3D=3D
>
> Flume is already deployed in production at multiple companies and they ar=
e
> actively participating in feature requests and user led discussions. Flum=
e
> is getting traction with developers and thus the risks of it being orphan=
ed
> are minimal.
>
> =3D=3D Inexperience with Open Source =3D=3D
>
> All code developed for Flume has is open sourced by Cloudera under Apache
> 2.0 license.  All committers of Flume project are intimately familiar wit=
h
> the Apache model for open-source development and are experienced with
> working with new contributors.
>
> =3D=3D Homogeneous Developers =3D=3D
>
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The participation of developers from several different
> organizations in the mailing list is a strong indication for this
> assertion.
>
> =3D=3D Reliance on Salaried Developers =3D=3D
>
> It is expected that Flume will be developed on salaried and volunteer tim=
e,
> although all of the initial developers will work on it mainly on salaried
> time.
>
> =3D=3D Relationships with Other Apache Products =3D=3D
>
> Flume depends upon other Apache Projects: Apache Hadoop, Apache Log4J,
> Apache ZooKeeper, Apache Thrift, Apache Avro, multiple Apache Commons
> components. Its build depends upon Apache Ant and Apache Maven.
>
> Flume users have created connectors that interact with several other Apac=
he
> projects including Apache HBase and Apache Cassandra.
>
> Flume's functionality has some indirect or direct overlap with the
> functionality of Apache Chukwa but has several significant architectural
> diffferences.  Both systems can be used to collect log data to write to
> hdfs.  However, Chukwa's primary goals are the analytic and monitoring
> aspects of a Hadoop cluster.  Instead of focusing on analytics, Flume
> focuses primarily upon data transport and integration with a wide set of
> data sources and data destinations.   Architecturally, Chukwa components
> are
> individually and statically configured.  It also depends upon Hadoop
> MapReduce for its core functionality.  In contrast, Flume's components ar=
e
> dynamically and centrally configured and does not depend directly upon
> Hadoop MapReduce.  Furthermore, Flume provides a more general model for
> handling data and enables integration with projects such as Apache Hive,
> data stores such as Apache HBase, Apache Cassandra and Voldemort, and
> several Apache Lucene-related projects.
>
> =3D=3D An Excessive Fascination with the Apache Brand =3D=3D
>
> We would like Flume to become an Apache project to further foster a healt=
hy
> community of contributors and consumers around the project.  Since Flume
> directly interacts with many Apache Hadoop-related projects by solves an
> important problem of many Hadoop users, residing in the the Apache Softwa=
re
> Foundation will increase interaction with the larger community.
>
> =3D Documentation =3D
>
>  * All Flume documentation (User Guide, Developer Guide, Cookbook, and
> Windows Guide) is maintained within Flume sources and can be built
> directly.
>  * Cloudera provides documentation specific to its distribution of Flume
> at:
> http://archive.cloudera.com/cdh/3/flume/
>  * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki
>  * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume
>
> =3D Initial Source =3D
>
>  * https://github.com/cloudera/flume/tree/
>
> =3D=3D Source and Intellectual Property Submission Plan =3D=3D
>
>  * The initial source is already licensed under the Apache License, Versi=
on
> 2.0. https://github.com/cloudera/flume/blob/master/LICENSE
>
> =3D=3D External Dependencies =3D=3D
>
> The required external dependencies are all Apache License or compatible
> licenses. Following components with non-Apache licenses are enumerated:
>
>  * org.arabidopsis.ahocorasick : BSD-style
>
> Non-Apache build tools that are used by Flume are as follows:
>
>  * AsciiDoc: GNU GPLv2
>  * FindBugs: GNU LGPL
>  * Cobertura: GNU GPLv2
>  * PMD : BSD-style
>
> =3D=3D Cryptography =3D=3D
>
> Flume uses standard APIs and tools for SSH and SSL communication where
> necessary.
>
> =3D Required  Resources =3D
>
> =3D=3D Mailing lists =3D=3D
>
>  * flume-private (with moderated subscriptions)
>  * flume-dev
>  * flume-commits
>  * flume-user
>
> =3D=3D Subversion Directory =3D=3D
>
> https://svn.apache.org/repos/asf/incubator/flume
>
> =3D=3D Issue Tracking =3D=3D
>
> JIRA Flume (FLUME)
>
> =3D=3D Other Resources =3D=3D
>
> The existing code already has unit and integration tests so we would like=
 a
> Hudson instance to run them whenever a new patch is submitted. This can b=
e
> added after project creation.
>
> =3D Initial Committers =3D
>
>  * Andrew Bayer (abayer at cloudera dot com)
>  * Jonathan Hsieh (jon at cloudera dot com)
>  * Aaron Kimball (akimball83 at gmail dot com)
>  * Bruce Mitchener (bruce.mitchener at gmail dot com)
>  * Arvind Prabhakar (arvind at cloudera dot com)
>  * Ahmed Radwan (ahmed at cloudera dot com)
>  * Henry Robinson (henry at cloudera dot com)
>  * Eric Sammer (esammer at cloudera dot com)
>
> =3D Affiliations =3D
>
>  * Andrew Bayer, Cloudera
>  * Jonathan Hsieh, Cloudera
>  * Aaron Kimball, Odiago
>  * Bruce Mitchener, Independent
>  * Arvind Prabhakar, Cloudera
>  * Ahmed Radwan, Cloudera
>  * Henry Robinson, Cloudera
>  * Eric Sammer, Cloudera
>
>
> =3D Sponsors =3D
>
> =3D=3D Champion =3D=3D
>
>  * Nigel Daley
>
> =3D=3D Nominated Mentors =3D=3D
>
>  * Tom White
>  * Nigel Daley
>
> =3D=3D Sponsoring Entity =3D=3D
>
>  * Apache Incubator PMC
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>


--=20
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

--bcaec547c9d313d2b904a44f45f0--