incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: [VOTE] S4 to join the Incubator
Date Wed, 21 Sep 2011 03:40:56 GMT
+1 (non-binding)

On Tue, Sep 20, 2011 at 4:56 PM, Patrick Hunt <phunt@apache.org> wrote:
> It's been a nearly a week since the S4 proposal was submitted for
> discussion.  A few questions were asked, and the proposal was clarified
> in response.  Sufficient mentors have volunteered.  I thus feel we are
> now ready for a vote.
>
> The latest proposal can be found at the end of this email and at:
>
>  http://wiki.apache.org/incubator/S4Proposal
>
> The discussion regarding the proposal can be found at:
>
>  http://s.apache.org/RMU
>
> Please cast your votes:
>
> [  ] +1 Accept S4 for incubation
> [  ] +0 Indifferent to S4 incubation
> [  ] -1 Reject S4 for incubation
>
> This vote will close 72 hours from now.
>
> Thanks,
>
> Patrick
>
> ------------------
> = S4 Proposal =
>
> == Abstract ==
>
> S4 (Simple Scalable Streaming System) is a general-purpose,
> distributed, scalable, partially fault-tolerant, pluggable platform
> that allows programmers to easily develop applications for processing
> continuous, unbounded streams of data.
>
> == Proposal ==
>
> S4 is a software platform written in Java. Clients that send and
> receive events can be written in any programming language. S4 also
> includes a collection of modules called Processing Elements (or PEs
> for short) that implement basic functionality and can be used by
> application developers. In S4, keyed data events are routed with
> affinity to Processing Elements (PEs), which consume the events and do
> one or both of the following: (1) ''emit'' one or more events which
> may be consumed by other PEs, (2) ''publish'' results. The
> architecture resembles the Actors model, providing semantics of
> encapsulation and location transparency, thus allowing applications to
> be massively concurrent while exposing a simple programming  interface
> to application developers.
>
> To drive adoption and increase the number of contributors to the
> project, we may need to prioritize the focus based on feedback from
> the community. We believe that one of the top priorities and driving
> design principle for the S4 project is to provide a simple API that
> hides most of the complexity associated with distributed systems and
> concurrency. The project grew out of the need to provide a flexible
> platform for application developers and scientists that can be used
> for quick experimentation and production.
>
> S4 differs from existing Apache projects in a number of fundamental
> ways. Flume is an Incubator project that focuses on log processing,
> performing lightweight processing in a distributed fashion and
> accumulating log data in a centralized repository for batch
> processing. S4 instead performs all stream processing in a distributed
> fashion and enables applications to form arbitrary graphs to process
> streams of events. We see Flume as a complementary project. We also
> expect S4 to complement Hadoop processing and in some cases to
> supersede it. Kafka is another Incubator project that focuses on
> processing large amounts of stream data. The design of Kafka, however,
> follows the pub-sub paradigm, which focuses on delivering messages
> containing arbitrary data from source processes (publishers) to
> consumer processes (subscribers). Compared to S4, Kafka is an
> intermediate step between data generation and processing, while S4 is
> itself a platform for processing streams of events.
>
> S4 overall addresses a need of existing applications to process
> streams of events beyond moving data to a centralized repository for
> batch processing. It complements the features of existing Apache
> projects, such as Hadoop, Flume, and Kafka, by providing a flexible
> platform for distributed event processing.
>
> == Background ==
>
> S4 was initially developed at Yahoo! Labs starting in 2008 to process
> user feedback in the context of search advertising. The project was
> licensed under the Apache License version 2.0 in October 2010. The
> project documentation is currently available at http://s4.io .
>
> == Rationale ==
>
> Stream computing has been growing steadily over the last 20 years.
> However, recently there has been an explosion in real-time data
> sources including the Web, sensor networks, financial securities
> analysis and trading, traffic monitoring, natural language processing
> of news and social data, and much more.
>
> As Hadoop evolved as a standard open source solution for batch
> processing of massive data sets, there is no equivalent community
> supported open source platform for processing data streams in
> real-time. While various research projects have evolved into
> proprietary commercial products, S4 has the potential to fill the gap.
> Many projects that require a scalable stream processing architecture
> currently use Hadoop by segmenting the input stream into data batches.
> This solution is not efficient, results in high latency, and
> introduces unnecessary complexity.
>
> The S4 design is primarily driven by large scale applications for data
> mining and machine learning in a production environment. We think that
> the S4 design is surprisingly flexible and lends itself to run in
> large clusters built with commodity hardware.
>
> S4 enables application programmers to focus more on the application
> and less on the infrastructure. S4 also provides a consistent graph
> oriented programming model that, if widely adopted, will facilitate
> sharing of basic component across developers.
>
> == Initial Goals ==
>
> The basic S4 infrastructure is complete and can be used in real-world
> applications. However, many additional components need to be developed
> and improved. Some areas we hope to focus on in Apache:
>
>  * Add a reliable communication protocol option to the communication
> layer for low bandwidth control messages that require guaranteed
> delivery.
>  * Higher-performance serialization and inter-node communication.
>  * Functionality to save the state of PEs at runtime transparently and
> restore it at startup.
>  * Intelligent load shedding strategies.
>  * Dynamic load balancing to make it possible to add and remove nodes
> from the cluster without data loss.
>  * Dynamic application loading and unloading.
>  * Migration to a pure object-oriented design that takes advantage of
> Java static typing using Generics in the framework code. (Keep it
> simple for the application developer.)
>  * Eliminate string identifiers and XML configuration.
>  * Adopt JSR 330 (Dependency Injection for Java).
>  * Add real-time query support.
>  * Add a cluster management system.
>
> Clearly this is a long list but sets the high level roadmap for the project.
>
> == Current Status ==
>
> The project has been under development at Yahoo! since late 2008, and
> it was open sourced in October 2010. Since then we have received
> patches from developers, started a discussion forum, and improved the
> documentation.
>
> === Meritocracy ===
>
> The S4 project was initially developed at Yahoo! Labs, a
> research-oriented organization that values original ideas and
> individual contributions. The design evolved in a bottom up fashion,
> where decisions were driven by the application and the long-term
> viability and flexibility of the platform. Once the project became
> open-source it continued to be managed by those who were actively
> doing the work.
>
> === Community ===
>
> S4 is currently in use internally at Yahoo!, and since it was released
> as an open source project it has received positive feedback and
> contributions from developers.
>
> === Core Developers ===
>
> S4 developers span a few companies and work on a voluntary basis. We
> expect to have developers from other organizations joining the team in
> the next few months, especially if S4 joins the Apache Incubator
> project. Being an Apache Incubator project is likely to attract the
> attention of more talented developers.
>
> One interesting aspect of the current group of developers is the
> diverse background:
>
>  * Kishore Gopalakrishna was the main developer of the communication
> layer and the integration with Zookeeper. He has been an active
> contributor to Hadoop.
>  * Flavio Junqueira has a background in distributed computing. He is a
> committer of ZooKeeper, a ZooKeeper PMC member, and a committer of
> BookKeeper;
>  * Matthieu Morel has extensive background in distributed systems, he
> likes theory and loves to implement things. He has been the main
> designer and implementor of S4 checkpointing.* Anish Nair has been the
> project’s main customer. With his background on natural language
> processing and algorithms he developed the applications that drove the
> S4 design including processing of social feeds and real-time
> recommendation engines.
>  * Leo Neumeyer has a background in signal processing and statistical
> modeling but has been advocating clean simple software design
> throughout his career. At Yahoo! he conceived and championed the S4
> project as a solution to improve monetization in search advertising.
>  * Bruce Robbins has been the main S4 developer, taking the concept
> from idea to releases. Bruce engineering experience ranges from
> programming Mainframe computers to assembly code.
>
> === Alignment ===
>
> S4 brings stream processing capabilities that complement Hadoop's
> batch processing capabilities.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> S4 has been used in production at Yahoo! and is being evaluated by
> other organizations. The developers have continued to support the
> project on their own time. We believe that adoption will increase
> significantly as more tools and documentation become available. As the
> project evolves, we may see new ideas that we may want to adopt or, if
> it makes sense and is practical, we may want to merge two or more open
> source projects. We believe that there is a clear need to have a well
> supported open source stream processing platform and therefore, there
> is low risk of the project becoming orphan. However, we are open to
> combining projects in order to have fewer projects with a more active
> community. Ultimately, this will be decided by the design ideas, the
> implementation quality, and the adoption.
>
> === Inexperience with Open Source ===
>
> The S4 code was open sourced by Yahoo! under Apache 2.0 license. One
> committer of the S4 project, Flavio Junqueira, is intimately familiar
> with the Apache model for open-source development and is experienced
> with working with new contributors.  Flavio is both a committer a PMC
> member for ZooKeeper. The other developers have had experience as
> contributors in other open-source projects. Most of the original S4
> developers continue to be committers.
>
> === Homogeneous Developers ===
>
> The initial set of committers for S4 represent four different
> companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse
> enough for a starting project.
>
> === Reliance on Salaried Developers ===
>
> Some committers are contributing as part of their jobs, but as we move
> to a more diverse set of developers we expect a good mix of salaried
> and volunteer time.
>
> === Relationships with Other Apache Projects ===
>
> S4 relies on the following Apache projects:
>
>  * BCEL (bytecode generation library)
>  * commons cli (command line interface)
>  * commons logging (needed by some other dependency)
>  * log4j
>  * commons jexl (expression processing)
>  * zookeeper
>  * Maven and its usual plug-ins (build time only)
>
> Compared to existing projects, S4 complements existing functionality
> in a few ways summarized below:
>  * Flume: S4 processes streams in a distributed fashion and enables
> applications to form arbitrary graphs of processing elements. Flume
> focuses on accumulating streams of logs in a centalized repository for
> batch processing;
>  * Kafka: Kafka is a pub/sub messaging layer that interposes
> generation of events and processing, while S4 itself forwards events
> and processes them in a stream fashion.
>  * Hadoop: Hadoop focuses on batch processing of large data sets,
> while S4 is a platform for stream processing of events. We would like
> to implement extensions that enable processing in both platforms with
> the same code.
>
> === An Excessive Fascination with the Apache Brand ===
>
> The project has already received a significant amount of attention and
> so far has been associated with Yahoo!. We would like, however, to
> foster the development of a community around S4 that evolves
> independently of the interests of a single company. Given the reliance
> of S4 on some Apache projects and the principles promoted by the
> foundation, we find it a suitable home for the project.
>
> == Documentation ==
>
>  * S4 Website: http://s4.io
>  * S4 documentation: http://docs.s4.io/
>  * S4 Forum: http://groups.google.com/group/s4-project/topics
>  * S4 Mailing list (with archives): http://groups.google.com/group/s4-project
>
> == Source and Intellectual Property Submission Plan ==
>
> The S4 source code is already licensed under Apache Software License
> 2.0. The source code is available at https://github.com/s4
>
>
> == External Dependencies ==
>
>  * asm (3-clause BSD license)
>  * json (json.org's own license
> http://www.crockford.com/JSON/license.html which is acceptable as per
> Apache FAQ: http://www.apache.org/legal/resolved.html#json)
>  * kryo (4-clause BSD license)
>  * spring framework (Apache license - v 2)
>  * codehaus jackson (Apache license)
>  * junit (Common Public License - v 1.0)
>
> == Cryptography ==
> None
>
> == Required Resources ==
>
> === Mailing lists ===
>  * s4-dev
>  * s4-user
>  * s4-private (with moderated subscriptions)
>  * s4-commit
>
> === Subversion Directory ===
>
> https://svn.apache.org/repos/asf/incubator/s4
>
> === Issue Tracking ===
>
> JIRA S4 (S4)
>
> == Initial Committers ==
>  * Kishore Gopalakrishna (kg at s4 dot io)
>  * Flavio Junqueira (fpj at s4 dot io)
>  * Matthieu Morel (mm at s4 dot io)
>  * Anish Nair (an at s4 dot com)
>  * Leo Neumeyer (leo at s4 dot io)
>  * Bruce Robbins (br at s4 dot io)
>
> == Affiliations ==
>  * Kishore Gopalakrishna, Linkedin
>  * Flavio Junqueira, Yahoo!
>  * Matthieu Morel, Yahoo!
>  * Anish Nair, A9
>  * Leo Neumeyer, Quantbench
>  * Bruce Robbins, Yahoo!
>
> == Sponsors ==
>
> === Champion ===
>
>  * Patrick Hunt
>
> === Nominated Mentors ===
>
>  * Patrick Hunt
>  * Owen O’Malley
>  * Arun Murthy
>
> === Sponsoring Entity ===
>
>  * Apache Incubator PMC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message