incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject [VOTE] S4 to join the Incubator
Date Tue, 20 Sep 2011 20:56:58 GMT
It's been a nearly a week since the S4 proposal was submitted for
discussion.  A few questions were asked, and the proposal was clarified
in response.  Sufficient mentors have volunteered.  I thus feel we are
now ready for a vote.

The latest proposal can be found at the end of this email and at:

 http://wiki.apache.org/incubator/S4Proposal

The discussion regarding the proposal can be found at:

 http://s.apache.org/RMU

Please cast your votes:

[  ] +1 Accept S4 for incubation
[  ] +0 Indifferent to S4 incubation
[  ] -1 Reject S4 for incubation

This vote will close 72 hours from now.

Thanks,

Patrick

------------------
= S4 Proposal =

== Abstract ==

S4 (Simple Scalable Streaming System) is a general-purpose,
distributed, scalable, partially fault-tolerant, pluggable platform
that allows programmers to easily develop applications for processing
continuous, unbounded streams of data.

== Proposal ==

S4 is a software platform written in Java. Clients that send and
receive events can be written in any programming language. S4 also
includes a collection of modules called Processing Elements (or PEs
for short) that implement basic functionality and can be used by
application developers. In S4, keyed data events are routed with
affinity to Processing Elements (PEs), which consume the events and do
one or both of the following: (1) ''emit'' one or more events which
may be consumed by other PEs, (2) ''publish'' results. The
architecture resembles the Actors model, providing semantics of
encapsulation and location transparency, thus allowing applications to
be massively concurrent while exposing a simple programming  interface
to application developers.

To drive adoption and increase the number of contributors to the
project, we may need to prioritize the focus based on feedback from
the community. We believe that one of the top priorities and driving
design principle for the S4 project is to provide a simple API that
hides most of the complexity associated with distributed systems and
concurrency. The project grew out of the need to provide a flexible
platform for application developers and scientists that can be used
for quick experimentation and production.

S4 differs from existing Apache projects in a number of fundamental
ways. Flume is an Incubator project that focuses on log processing,
performing lightweight processing in a distributed fashion and
accumulating log data in a centralized repository for batch
processing. S4 instead performs all stream processing in a distributed
fashion and enables applications to form arbitrary graphs to process
streams of events. We see Flume as a complementary project. We also
expect S4 to complement Hadoop processing and in some cases to
supersede it. Kafka is another Incubator project that focuses on
processing large amounts of stream data. The design of Kafka, however,
follows the pub-sub paradigm, which focuses on delivering messages
containing arbitrary data from source processes (publishers) to
consumer processes (subscribers). Compared to S4, Kafka is an
intermediate step between data generation and processing, while S4 is
itself a platform for processing streams of events.

S4 overall addresses a need of existing applications to process
streams of events beyond moving data to a centralized repository for
batch processing. It complements the features of existing Apache
projects, such as Hadoop, Flume, and Kafka, by providing a flexible
platform for distributed event processing.

== Background ==

S4 was initially developed at Yahoo! Labs starting in 2008 to process
user feedback in the context of search advertising. The project was
licensed under the Apache License version 2.0 in October 2010. The
project documentation is currently available at http://s4.io .

== Rationale ==

Stream computing has been growing steadily over the last 20 years.
However, recently there has been an explosion in real-time data
sources including the Web, sensor networks, financial securities
analysis and trading, traffic monitoring, natural language processing
of news and social data, and much more.

As Hadoop evolved as a standard open source solution for batch
processing of massive data sets, there is no equivalent community
supported open source platform for processing data streams in
real-time. While various research projects have evolved into
proprietary commercial products, S4 has the potential to fill the gap.
Many projects that require a scalable stream processing architecture
currently use Hadoop by segmenting the input stream into data batches.
This solution is not efficient, results in high latency, and
introduces unnecessary complexity.

The S4 design is primarily driven by large scale applications for data
mining and machine learning in a production environment. We think that
the S4 design is surprisingly flexible and lends itself to run in
large clusters built with commodity hardware.

S4 enables application programmers to focus more on the application
and less on the infrastructure. S4 also provides a consistent graph
oriented programming model that, if widely adopted, will facilitate
sharing of basic component across developers.

== Initial Goals ==

The basic S4 infrastructure is complete and can be used in real-world
applications. However, many additional components need to be developed
and improved. Some areas we hope to focus on in Apache:

 * Add a reliable communication protocol option to the communication
layer for low bandwidth control messages that require guaranteed
delivery.
 * Higher-performance serialization and inter-node communication.
 * Functionality to save the state of PEs at runtime transparently and
restore it at startup.
 * Intelligent load shedding strategies.
 * Dynamic load balancing to make it possible to add and remove nodes
from the cluster without data loss.
 * Dynamic application loading and unloading.
 * Migration to a pure object-oriented design that takes advantage of
Java static typing using Generics in the framework code. (Keep it
simple for the application developer.)
 * Eliminate string identifiers and XML configuration.
 * Adopt JSR 330 (Dependency Injection for Java).
 * Add real-time query support.
 * Add a cluster management system.

Clearly this is a long list but sets the high level roadmap for the project.

== Current Status ==

The project has been under development at Yahoo! since late 2008, and
it was open sourced in October 2010. Since then we have received
patches from developers, started a discussion forum, and improved the
documentation.

=== Meritocracy ===

The S4 project was initially developed at Yahoo! Labs, a
research-oriented organization that values original ideas and
individual contributions. The design evolved in a bottom up fashion,
where decisions were driven by the application and the long-term
viability and flexibility of the platform. Once the project became
open-source it continued to be managed by those who were actively
doing the work.

=== Community ===

S4 is currently in use internally at Yahoo!, and since it was released
as an open source project it has received positive feedback and
contributions from developers.

=== Core Developers ===

S4 developers span a few companies and work on a voluntary basis. We
expect to have developers from other organizations joining the team in
the next few months, especially if S4 joins the Apache Incubator
project. Being an Apache Incubator project is likely to attract the
attention of more talented developers.

One interesting aspect of the current group of developers is the
diverse background:

 * Kishore Gopalakrishna was the main developer of the communication
layer and the integration with Zookeeper. He has been an active
contributor to Hadoop.
 * Flavio Junqueira has a background in distributed computing. He is a
committer of ZooKeeper, a ZooKeeper PMC member, and a committer of
BookKeeper;
 * Matthieu Morel has extensive background in distributed systems, he
likes theory and loves to implement things. He has been the main
designer and implementor of S4 checkpointing.* Anish Nair has been the
project’s main customer. With his background on natural language
processing and algorithms he developed the applications that drove the
S4 design including processing of social feeds and real-time
recommendation engines.
 * Leo Neumeyer has a background in signal processing and statistical
modeling but has been advocating clean simple software design
throughout his career. At Yahoo! he conceived and championed the S4
project as a solution to improve monetization in search advertising.
 * Bruce Robbins has been the main S4 developer, taking the concept
from idea to releases. Bruce engineering experience ranges from
programming Mainframe computers to assembly code.

=== Alignment ===

S4 brings stream processing capabilities that complement Hadoop's
batch processing capabilities.

== Known Risks ==

=== Orphaned Products ===

S4 has been used in production at Yahoo! and is being evaluated by
other organizations. The developers have continued to support the
project on their own time. We believe that adoption will increase
significantly as more tools and documentation become available. As the
project evolves, we may see new ideas that we may want to adopt or, if
it makes sense and is practical, we may want to merge two or more open
source projects. We believe that there is a clear need to have a well
supported open source stream processing platform and therefore, there
is low risk of the project becoming orphan. However, we are open to
combining projects in order to have fewer projects with a more active
community. Ultimately, this will be decided by the design ideas, the
implementation quality, and the adoption.

=== Inexperience with Open Source ===

The S4 code was open sourced by Yahoo! under Apache 2.0 license. One
committer of the S4 project, Flavio Junqueira, is intimately familiar
with the Apache model for open-source development and is experienced
with working with new contributors.  Flavio is both a committer a PMC
member for ZooKeeper. The other developers have had experience as
contributors in other open-source projects. Most of the original S4
developers continue to be committers.

=== Homogeneous Developers ===

The initial set of committers for S4 represent four different
companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse
enough for a starting project.

=== Reliance on Salaried Developers ===

Some committers are contributing as part of their jobs, but as we move
to a more diverse set of developers we expect a good mix of salaried
and volunteer time.

=== Relationships with Other Apache Projects ===

S4 relies on the following Apache projects:

 * BCEL (bytecode generation library)
 * commons cli (command line interface)
 * commons logging (needed by some other dependency)
 * log4j
 * commons jexl (expression processing)
 * zookeeper
 * Maven and its usual plug-ins (build time only)

Compared to existing projects, S4 complements existing functionality
in a few ways summarized below:
 * Flume: S4 processes streams in a distributed fashion and enables
applications to form arbitrary graphs of processing elements. Flume
focuses on accumulating streams of logs in a centalized repository for
batch processing;
 * Kafka: Kafka is a pub/sub messaging layer that interposes
generation of events and processing, while S4 itself forwards events
and processes them in a stream fashion.
 * Hadoop: Hadoop focuses on batch processing of large data sets,
while S4 is a platform for stream processing of events. We would like
to implement extensions that enable processing in both platforms with
the same code.

=== An Excessive Fascination with the Apache Brand ===

The project has already received a significant amount of attention and
so far has been associated with Yahoo!. We would like, however, to
foster the development of a community around S4 that evolves
independently of the interests of a single company. Given the reliance
of S4 on some Apache projects and the principles promoted by the
foundation, we find it a suitable home for the project.

== Documentation ==

 * S4 Website: http://s4.io
 * S4 documentation: http://docs.s4.io/
 * S4 Forum: http://groups.google.com/group/s4-project/topics
 * S4 Mailing list (with archives): http://groups.google.com/group/s4-project

== Source and Intellectual Property Submission Plan ==

The S4 source code is already licensed under Apache Software License
2.0. The source code is available at https://github.com/s4


== External Dependencies ==

 * asm (3-clause BSD license)
 * json (json.org's own license
http://www.crockford.com/JSON/license.html which is acceptable as per
Apache FAQ: http://www.apache.org/legal/resolved.html#json)
 * kryo (4-clause BSD license)
 * spring framework (Apache license - v 2)
 * codehaus jackson (Apache license)
 * junit (Common Public License - v 1.0)

== Cryptography ==
None

== Required Resources ==

=== Mailing lists ===
 * s4-dev
 * s4-user
 * s4-private (with moderated subscriptions)
 * s4-commit

=== Subversion Directory ===

https://svn.apache.org/repos/asf/incubator/s4

=== Issue Tracking ===

JIRA S4 (S4)

== Initial Committers ==
 * Kishore Gopalakrishna (kg at s4 dot io)
 * Flavio Junqueira (fpj at s4 dot io)
 * Matthieu Morel (mm at s4 dot io)
 * Anish Nair (an at s4 dot com)
 * Leo Neumeyer (leo at s4 dot io)
 * Bruce Robbins (br at s4 dot io)

== Affiliations ==
 * Kishore Gopalakrishna, Linkedin
 * Flavio Junqueira, Yahoo!
 * Matthieu Morel, Yahoo!
 * Anish Nair, A9
 * Leo Neumeyer, Quantbench
 * Bruce Robbins, Yahoo!

== Sponsors ==

=== Champion ===

 * Patrick Hunt

=== Nominated Mentors ===

 * Patrick Hunt
 * Owen O’Malley
 * Arun Murthy

=== Sponsoring Entity ===

 * Apache Incubator PMC

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message