incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject [PROPOSAL] Kafka for the Apache Incubator
Date Wed, 22 Jun 2011 16:17:59 GMT
Hi,

I would like to propose Kafka to be an Apache Incubator project.  Kafka is a
distributed, high throughput, publish-subscribe system for processing large
amounts of streaming data.

Here's a link to the proposal in the Incubator wiki
http://wiki.apache.org/incubator/KafkaProposal

I've also pasted the initial contents below.

Thanks,

Jun

== Abstract ==
Kafka is a distributed publish-subscribe system for processing large amounts
of streaming data.

== Proposal ==
Kafka provides an extremely high throughput distributed publish/subscribe
messaging system.  Additionally, it supports relatively long term
persistence of messages to support a wide variety of consumers, partitioning
of the message stream across servers and consumers, and functionality for
loading data into Apache Hadoop for offline, batch processing.

== Background ==
Kafka was developed at LinkedIn to process the large amounts of events
generated by that company's website and provide a common repository for many
types of consumers to access and process those events. Kafka has been used
in production at LinkedIn scale to handle dozens of types of events
including page views, searches and social network activity. Kafka clusters
at LinkedIn currently process more than two billion events per day.

Kafka fills the gap between messaging systems such as Apache ActiveMQ, which
can provide high-volume messaging systems but lack persistence of those
messages, and log processing systems such as Scribe and Flume, which do not
provide adequate latency for our diverse set of consumers.  Kafka can also
be inserted into traditional log-processing systems, acting as an
intermediate step before further processing. Kafka focuses relentlessly on
performance and throughput by not introspecting into message content, nor
indexing them on the broker.  We also achieve high performance by depending
on Java's sendFile/transferTo capabilities to minimize intermediate buffer
copies and relying on the OS's pagecache to efficiently serve up message
contents to consumers.

Kafka is written in Scala and depends on Apache ZooKeeper for coordination
amongst its producers, brokers and consumers.

Kafka was developed internally at LinkedIn to meet our particular use cases,
but will be useful to many organizations facing a similar need to reliably
process large amounts of streaming data.  Therefore, we would like to share
it the ASF and begin developing a community of developers and users within
Apache.

== Rationale ==
Many organizations can benefit from a reliable stream processing system such
as Kafka.  While our use case of processing events from a very large website
like LinkedIn has driven the design of Kafka, its uses are varied and we
expect many new use cases to emerge.  Kafka provides a natural bridge
between near real-time event processing and offline batch processing and
will appeal to many users.

== Current Status ==
=== Meritocracy ===
Our intent with this incubator proposal is to start building a diverse
developer community around Kafka following the Apache meritocracy model.
Since Kafka was open sourced we have solicited contributions via the website
and presentations given to user groups and technical audiences.  We have had
positive responses to these and have received several contributions and
clients for other languages.  We plan to continue this support for new
contributors and work with those who contribute significantly to the project
to make them committers.

=== Community ===
Kafka is currently being used by developed by engineers within LinkedIn and
used in production in that company. Additionally, we have active users in or
have received contributions from a diverse set of companies including
MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
presentations of Kafka and its goals garnered much interest from potential
contributors. We hope to extend our contributor base significantly and
invite all those who are interested in building high-throughput distributed
systems to participate.  We have begun receiving contributions from outside
of LinkedIn, including clients for several languages including Ruby, PHP,
Clojure, .NET and Python.

To further this goal, we use GitHub issue tracking and branching facilities,
as well as maintaining a public mailing list via Google Groups.

=== Core Developers ===
Kafka is currently being developed by four engineers at LinkedIn: Neha
Narkhede, Jun Rao, Jakob Homan and Jay Kreps. Jun has experience within
Apache as a Cassandra committer and PMC member. Neha has been an active
contributor to several projects LinkedIn has open sourced, including Bobo,
Sensei and Zoie. Jay has experience with open source software as the
originator of the Project Voldemort project, as well as being active within
the Hadoop ecosystem community. Jakob is an Apache Hadoop committer and PMC
and previous Apache ZooKeeper contributor.

=== Alignment ===
The ASF is the natural choice to host the Kafka project as its goal of
encouraging community-driven open-source projects fits with our vision for
Kafka.  Additionally, many other projects with which we are familiar with
and expect Kafka to integrate with, such as Apache Hadoop, Pig, ZooKeeper
and log4j are hosted by the ASF and we will benefit and provide benefit by
close proximity to them.

== Known Risks ==
=== Orphaned Products ===
The core developers plan to work full time on the project. There is very
little risk of Kafka being abandoned as it is a critical part of LinkedIn's
internal infrastructure and is in production use.

=== Inexperience with Open Source ===
All of the core developers have experience with open source development.
 LinkedIn open sourced Kafka several months ago and has been receiving
contributions since.  Jun is an Apache Cassandra committer and PMC member.
 Jay and Neha have been involved with several open source projects released
by LinkedIn.  Jakob has been actively involved with the ASF as a full-time
Hadoop committer and PMC member.

=== Homogeneous Developers ===
The current core developers are all from LinkedIn. However, we hope to
establish a developer community that includes contributors from several
corporations and we actively encouraging new contributors via the mailing
lists and public presentations of Kafka.

=== Reliance on Salaried Developers ===
Currently, the developers are paid to do work on Kafka. However, once the
project has a community built around it, we expect to get committers,
developers and community from outside the current core developers. However,
because LinkedIn relies on Kafka internally, the reliance on salaried
developers is unlikely to change.

=== Relationships with Other Apache Products ===
Kafka is deeply integrated with Apache products. Kafka uses Apache ZooKeeper
to coordinate its state amongst the brokers, consumers, and soon, the
producers.  Kafka provides input formats to allow Hadoop MapReduce to load
data directly from Kafka.  Kafka provides an appender to allow consuming
data directly from Apache log4j.

=== An Excessive Fascination with the Apache Brand ===
While we respect the reputation of the Apache brand and have no doubts that
it will attract contributors and users, our interest is primarily to give
Kafka a solid home as an open source project following an established
development model. We have also given reasons in the Rationale and Alignment
sections.

== Documentation ==
Information about Kafka can be found at [http://sna-projects.com/kafka/] The
following links provide more information about the project:

 * Kafka roadmap and goals: [http://sna-projects.com/kafka/projects.php]
 * The GitHub site: [https://github.com/kafka-dev/kafka]
 * Kafka overview from Jay Kreps: [
http://www.slideshare.net/ydn/hug-january-2011-kafka-presentation]
 * Kafka overview from Jakob Homan: [http://bit.ly/fLmoZz]
 * Kafka paper at NetDB 2011: [
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
]

== Initial Source ==
Kafka has been under development at LinkedIn since November 2009.  It was
open sourced by LinkedIn in January 2011.  It is currently hosted on github
under the Apache license at [https://github.com/kafka-dev/kafka]

Kafka is mainly written in Scala with some performance testing code in Java.
 Several clients have been contributed in other languages, including Ruby,
PHP, Clojure, .NET and Python.  Its source tree is entirely self contained
and relies of simple build tool (sbt) as its build system and dependency
resolution mechanism.

== External Dependencies ==
The dependencies all have Apache compatible licenses.

== Cryptography ==
Not applicable.

== Required Resources ==
=== Mailing Lists ===
 * kafka-private for private PMC discussions (with moderated subscriptions)
  * kafka-dev   * kafka-commits   * kafka-user

=== Subversion Directory ===
[https://svn.apache.org/repos/asf/incubator/kafka]

=== Issue Tracking ===
JIRA Kafka (KAFKA)

=== Other Resources ===
The existing code already has unit tests, so we would like a Hudson instance
to run them whenever a new patch is submitted. This can be added after
project creation.

== Initial Committers ==
 * Jay Kreps
 * Jun Rao
 * Neha Narkhede
 * Jakob Homan

== Affiliations ==
 * Jay Kreps (LinkedIn)
 * Jun Rao (LinkedIn)
 * Neha Narkhede (LinkedIn)
 * Jakob Homan (LinkedIn)

== Sponsors ==
=== Champion ===
Chris Douglas (Apache Member)

=== Nominated Mentors ===
 * Alan Cabrera (Apache Member)
 * Geir Magnusson, Jr. (Apache Member and Director)
 * Owen O'Malley (Apache Member)

=== Sponsoring Entity ===
We are requesting the Incubator to sponsor this project.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message