incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <>
Subject Re: [PROPOSAL] Kafka for the Apache Incubator
Date Wed, 22 Jun 2011 16:49:58 GMT
Wow looks neat!


On Jun 22, 2011, at 9:17 AM, Jun Rao wrote:

> Hi,
> I would like to propose Kafka to be an Apache Incubator project.  Kafka is a
> distributed, high throughput, publish-subscribe system for processing large
> amounts of streaming data.
> Here's a link to the proposal in the Incubator wiki
> I've also pasted the initial contents below.
> Thanks,
> Jun
> == Abstract ==
> Kafka is a distributed publish-subscribe system for processing large amounts
> of streaming data.
> == Proposal ==
> Kafka provides an extremely high throughput distributed publish/subscribe
> messaging system.  Additionally, it supports relatively long term
> persistence of messages to support a wide variety of consumers, partitioning
> of the message stream across servers and consumers, and functionality for
> loading data into Apache Hadoop for offline, batch processing.
> == Background ==
> Kafka was developed at LinkedIn to process the large amounts of events
> generated by that company's website and provide a common repository for many
> types of consumers to access and process those events. Kafka has been used
> in production at LinkedIn scale to handle dozens of types of events
> including page views, searches and social network activity. Kafka clusters
> at LinkedIn currently process more than two billion events per day.
> Kafka fills the gap between messaging systems such as Apache ActiveMQ, which
> can provide high-volume messaging systems but lack persistence of those
> messages, and log processing systems such as Scribe and Flume, which do not
> provide adequate latency for our diverse set of consumers.  Kafka can also
> be inserted into traditional log-processing systems, acting as an
> intermediate step before further processing. Kafka focuses relentlessly on
> performance and throughput by not introspecting into message content, nor
> indexing them on the broker.  We also achieve high performance by depending
> on Java's sendFile/transferTo capabilities to minimize intermediate buffer
> copies and relying on the OS's pagecache to efficiently serve up message
> contents to consumers.
> Kafka is written in Scala and depends on Apache ZooKeeper for coordination
> amongst its producers, brokers and consumers.
> Kafka was developed internally at LinkedIn to meet our particular use cases,
> but will be useful to many organizations facing a similar need to reliably
> process large amounts of streaming data.  Therefore, we would like to share
> it the ASF and begin developing a community of developers and users within
> Apache.
> == Rationale ==
> Many organizations can benefit from a reliable stream processing system such
> as Kafka.  While our use case of processing events from a very large website
> like LinkedIn has driven the design of Kafka, its uses are varied and we
> expect many new use cases to emerge.  Kafka provides a natural bridge
> between near real-time event processing and offline batch processing and
> will appeal to many users.
> == Current Status ==
> === Meritocracy ===
> Our intent with this incubator proposal is to start building a diverse
> developer community around Kafka following the Apache meritocracy model.
> Since Kafka was open sourced we have solicited contributions via the website
> and presentations given to user groups and technical audiences.  We have had
> positive responses to these and have received several contributions and
> clients for other languages.  We plan to continue this support for new
> contributors and work with those who contribute significantly to the project
> to make them committers.
> === Community ===
> Kafka is currently being used by developed by engineers within LinkedIn and
> used in production in that company. Additionally, we have active users in or
> have received contributions from a diverse set of companies including
> MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
> presentations of Kafka and its goals garnered much interest from potential
> contributors. We hope to extend our contributor base significantly and
> invite all those who are interested in building high-throughput distributed
> systems to participate.  We have begun receiving contributions from outside
> of LinkedIn, including clients for several languages including Ruby, PHP,
> Clojure, .NET and Python.
> To further this goal, we use GitHub issue tracking and branching facilities,
> as well as maintaining a public mailing list via Google Groups.
> === Core Developers ===
> Kafka is currently being developed by four engineers at LinkedIn: Neha
> Narkhede, Jun Rao, Jakob Homan and Jay Kreps. Jun has experience within
> Apache as a Cassandra committer and PMC member. Neha has been an active
> contributor to several projects LinkedIn has open sourced, including Bobo,
> Sensei and Zoie. Jay has experience with open source software as the
> originator of the Project Voldemort project, as well as being active within
> the Hadoop ecosystem community. Jakob is an Apache Hadoop committer and PMC
> and previous Apache ZooKeeper contributor.
> === Alignment ===
> The ASF is the natural choice to host the Kafka project as its goal of
> encouraging community-driven open-source projects fits with our vision for
> Kafka.  Additionally, many other projects with which we are familiar with
> and expect Kafka to integrate with, such as Apache Hadoop, Pig, ZooKeeper
> and log4j are hosted by the ASF and we will benefit and provide benefit by
> close proximity to them.
> == Known Risks ==
> === Orphaned Products ===
> The core developers plan to work full time on the project. There is very
> little risk of Kafka being abandoned as it is a critical part of LinkedIn's
> internal infrastructure and is in production use.
> === Inexperience with Open Source ===
> All of the core developers have experience with open source development.
> LinkedIn open sourced Kafka several months ago and has been receiving
> contributions since.  Jun is an Apache Cassandra committer and PMC member.
> Jay and Neha have been involved with several open source projects released
> by LinkedIn.  Jakob has been actively involved with the ASF as a full-time
> Hadoop committer and PMC member.
> === Homogeneous Developers ===
> The current core developers are all from LinkedIn. However, we hope to
> establish a developer community that includes contributors from several
> corporations and we actively encouraging new contributors via the mailing
> lists and public presentations of Kafka.
> === Reliance on Salaried Developers ===
> Currently, the developers are paid to do work on Kafka. However, once the
> project has a community built around it, we expect to get committers,
> developers and community from outside the current core developers. However,
> because LinkedIn relies on Kafka internally, the reliance on salaried
> developers is unlikely to change.
> === Relationships with Other Apache Products ===
> Kafka is deeply integrated with Apache products. Kafka uses Apache ZooKeeper
> to coordinate its state amongst the brokers, consumers, and soon, the
> producers.  Kafka provides input formats to allow Hadoop MapReduce to load
> data directly from Kafka.  Kafka provides an appender to allow consuming
> data directly from Apache log4j.
> === An Excessive Fascination with the Apache Brand ===
> While we respect the reputation of the Apache brand and have no doubts that
> it will attract contributors and users, our interest is primarily to give
> Kafka a solid home as an open source project following an established
> development model. We have also given reasons in the Rationale and Alignment
> sections.
> == Documentation ==
> Information about Kafka can be found at [] The
> following links provide more information about the project:
> * Kafka roadmap and goals: []
> * The GitHub site: []
> * Kafka overview from Jay Kreps: [
> * Kafka overview from Jakob Homan: []
> * Kafka paper at NetDB 2011: [
> ]
> == Initial Source ==
> Kafka has been under development at LinkedIn since November 2009.  It was
> open sourced by LinkedIn in January 2011.  It is currently hosted on github
> under the Apache license at []
> Kafka is mainly written in Scala with some performance testing code in Java.
> Several clients have been contributed in other languages, including Ruby,
> PHP, Clojure, .NET and Python.  Its source tree is entirely self contained
> and relies of simple build tool (sbt) as its build system and dependency
> resolution mechanism.
> == External Dependencies ==
> The dependencies all have Apache compatible licenses.
> == Cryptography ==
> Not applicable.
> == Required Resources ==
> === Mailing Lists ===
> * kafka-private for private PMC discussions (with moderated subscriptions)
>  * kafka-dev   * kafka-commits   * kafka-user
> === Subversion Directory ===
> []
> === Issue Tracking ===
> JIRA Kafka (KAFKA)
> === Other Resources ===
> The existing code already has unit tests, so we would like a Hudson instance
> to run them whenever a new patch is submitted. This can be added after
> project creation.
> == Initial Committers ==
> * Jay Kreps
> * Jun Rao
> * Neha Narkhede
> * Jakob Homan
> == Affiliations ==
> * Jay Kreps (LinkedIn)
> * Jun Rao (LinkedIn)
> * Neha Narkhede (LinkedIn)
> * Jakob Homan (LinkedIn)
> == Sponsors ==
> === Champion ===
> Chris Douglas (Apache Member)
> === Nominated Mentors ===
> * Alan Cabrera (Apache Member)
> * Geir Magnusson, Jr. (Apache Member and Director)
> * Owen O'Malley (Apache Member)
> === Sponsoring Entity ===
> We are requesting the Incubator to sponsor this project.

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message