incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: [VOTE] S4 to join the Incubator
Date Wed, 21 Sep 2011 01:38:34 GMT
+1 (binding)

Arun

On Sep 20, 2011, at 1:56 PM, Patrick Hunt wrote:

> It's been a nearly a week since the S4 proposal was submitted for
> discussion.  A few questions were asked, and the proposal was clarified
> in response.  Sufficient mentors have volunteered.  I thus feel we are
> now ready for a vote.
> 
> The latest proposal can be found at the end of this email and at:
> 
> http://wiki.apache.org/incubator/S4Proposal
> 
> The discussion regarding the proposal can be found at:
> 
> http://s.apache.org/RMU
> 
> Please cast your votes:
> 
> [  ] +1 Accept S4 for incubation
> [  ] +0 Indifferent to S4 incubation
> [  ] -1 Reject S4 for incubation
> 
> This vote will close 72 hours from now.
> 
> Thanks,
> 
> Patrick
> 
> ------------------
> = S4 Proposal =
> 
> == Abstract ==
> 
> S4 (Simple Scalable Streaming System) is a general-purpose,
> distributed, scalable, partially fault-tolerant, pluggable platform
> that allows programmers to easily develop applications for processing
> continuous, unbounded streams of data.
> 
> == Proposal ==
> 
> S4 is a software platform written in Java. Clients that send and
> receive events can be written in any programming language. S4 also
> includes a collection of modules called Processing Elements (or PEs
> for short) that implement basic functionality and can be used by
> application developers. In S4, keyed data events are routed with
> affinity to Processing Elements (PEs), which consume the events and do
> one or both of the following: (1) ''emit'' one or more events which
> may be consumed by other PEs, (2) ''publish'' results. The
> architecture resembles the Actors model, providing semantics of
> encapsulation and location transparency, thus allowing applications to
> be massively concurrent while exposing a simple programming  interface
> to application developers.
> 
> To drive adoption and increase the number of contributors to the
> project, we may need to prioritize the focus based on feedback from
> the community. We believe that one of the top priorities and driving
> design principle for the S4 project is to provide a simple API that
> hides most of the complexity associated with distributed systems and
> concurrency. The project grew out of the need to provide a flexible
> platform for application developers and scientists that can be used
> for quick experimentation and production.
> 
> S4 differs from existing Apache projects in a number of fundamental
> ways. Flume is an Incubator project that focuses on log processing,
> performing lightweight processing in a distributed fashion and
> accumulating log data in a centralized repository for batch
> processing. S4 instead performs all stream processing in a distributed
> fashion and enables applications to form arbitrary graphs to process
> streams of events. We see Flume as a complementary project. We also
> expect S4 to complement Hadoop processing and in some cases to
> supersede it. Kafka is another Incubator project that focuses on
> processing large amounts of stream data. The design of Kafka, however,
> follows the pub-sub paradigm, which focuses on delivering messages
> containing arbitrary data from source processes (publishers) to
> consumer processes (subscribers). Compared to S4, Kafka is an
> intermediate step between data generation and processing, while S4 is
> itself a platform for processing streams of events.
> 
> S4 overall addresses a need of existing applications to process
> streams of events beyond moving data to a centralized repository for
> batch processing. It complements the features of existing Apache
> projects, such as Hadoop, Flume, and Kafka, by providing a flexible
> platform for distributed event processing.
> 
> == Background ==
> 
> S4 was initially developed at Yahoo! Labs starting in 2008 to process
> user feedback in the context of search advertising. The project was
> licensed under the Apache License version 2.0 in October 2010. The
> project documentation is currently available at http://s4.io .
> 
> == Rationale ==
> 
> Stream computing has been growing steadily over the last 20 years.
> However, recently there has been an explosion in real-time data
> sources including the Web, sensor networks, financial securities
> analysis and trading, traffic monitoring, natural language processing
> of news and social data, and much more.
> 
> As Hadoop evolved as a standard open source solution for batch
> processing of massive data sets, there is no equivalent community
> supported open source platform for processing data streams in
> real-time. While various research projects have evolved into
> proprietary commercial products, S4 has the potential to fill the gap.
> Many projects that require a scalable stream processing architecture
> currently use Hadoop by segmenting the input stream into data batches.
> This solution is not efficient, results in high latency, and
> introduces unnecessary complexity.
> 
> The S4 design is primarily driven by large scale applications for data
> mining and machine learning in a production environment. We think that
> the S4 design is surprisingly flexible and lends itself to run in
> large clusters built with commodity hardware.
> 
> S4 enables application programmers to focus more on the application
> and less on the infrastructure. S4 also provides a consistent graph
> oriented programming model that, if widely adopted, will facilitate
> sharing of basic component across developers.
> 
> == Initial Goals ==
> 
> The basic S4 infrastructure is complete and can be used in real-world
> applications. However, many additional components need to be developed
> and improved. Some areas we hope to focus on in Apache:
> 
> * Add a reliable communication protocol option to the communication
> layer for low bandwidth control messages that require guaranteed
> delivery.
> * Higher-performance serialization and inter-node communication.
> * Functionality to save the state of PEs at runtime transparently and
> restore it at startup.
> * Intelligent load shedding strategies.
> * Dynamic load balancing to make it possible to add and remove nodes
> from the cluster without data loss.
> * Dynamic application loading and unloading.
> * Migration to a pure object-oriented design that takes advantage of
> Java static typing using Generics in the framework code. (Keep it
> simple for the application developer.)
> * Eliminate string identifiers and XML configuration.
> * Adopt JSR 330 (Dependency Injection for Java).
> * Add real-time query support.
> * Add a cluster management system.
> 
> Clearly this is a long list but sets the high level roadmap for the project.
> 
> == Current Status ==
> 
> The project has been under development at Yahoo! since late 2008, and
> it was open sourced in October 2010. Since then we have received
> patches from developers, started a discussion forum, and improved the
> documentation.
> 
> === Meritocracy ===
> 
> The S4 project was initially developed at Yahoo! Labs, a
> research-oriented organization that values original ideas and
> individual contributions. The design evolved in a bottom up fashion,
> where decisions were driven by the application and the long-term
> viability and flexibility of the platform. Once the project became
> open-source it continued to be managed by those who were actively
> doing the work.
> 
> === Community ===
> 
> S4 is currently in use internally at Yahoo!, and since it was released
> as an open source project it has received positive feedback and
> contributions from developers.
> 
> === Core Developers ===
> 
> S4 developers span a few companies and work on a voluntary basis. We
> expect to have developers from other organizations joining the team in
> the next few months, especially if S4 joins the Apache Incubator
> project. Being an Apache Incubator project is likely to attract the
> attention of more talented developers.
> 
> One interesting aspect of the current group of developers is the
> diverse background:
> 
> * Kishore Gopalakrishna was the main developer of the communication
> layer and the integration with Zookeeper. He has been an active
> contributor to Hadoop.
> * Flavio Junqueira has a background in distributed computing. He is a
> committer of ZooKeeper, a ZooKeeper PMC member, and a committer of
> BookKeeper;
> * Matthieu Morel has extensive background in distributed systems, he
> likes theory and loves to implement things. He has been the main
> designer and implementor of S4 checkpointing.* Anish Nair has been the
> project’s main customer. With his background on natural language
> processing and algorithms he developed the applications that drove the
> S4 design including processing of social feeds and real-time
> recommendation engines.
> * Leo Neumeyer has a background in signal processing and statistical
> modeling but has been advocating clean simple software design
> throughout his career. At Yahoo! he conceived and championed the S4
> project as a solution to improve monetization in search advertising.
> * Bruce Robbins has been the main S4 developer, taking the concept
> from idea to releases. Bruce engineering experience ranges from
> programming Mainframe computers to assembly code.
> 
> === Alignment ===
> 
> S4 brings stream processing capabilities that complement Hadoop's
> batch processing capabilities.
> 
> == Known Risks ==
> 
> === Orphaned Products ===
> 
> S4 has been used in production at Yahoo! and is being evaluated by
> other organizations. The developers have continued to support the
> project on their own time. We believe that adoption will increase
> significantly as more tools and documentation become available. As the
> project evolves, we may see new ideas that we may want to adopt or, if
> it makes sense and is practical, we may want to merge two or more open
> source projects. We believe that there is a clear need to have a well
> supported open source stream processing platform and therefore, there
> is low risk of the project becoming orphan. However, we are open to
> combining projects in order to have fewer projects with a more active
> community. Ultimately, this will be decided by the design ideas, the
> implementation quality, and the adoption.
> 
> === Inexperience with Open Source ===
> 
> The S4 code was open sourced by Yahoo! under Apache 2.0 license. One
> committer of the S4 project, Flavio Junqueira, is intimately familiar
> with the Apache model for open-source development and is experienced
> with working with new contributors.  Flavio is both a committer a PMC
> member for ZooKeeper. The other developers have had experience as
> contributors in other open-source projects. Most of the original S4
> developers continue to be committers.
> 
> === Homogeneous Developers ===
> 
> The initial set of committers for S4 represent four different
> companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse
> enough for a starting project.
> 
> === Reliance on Salaried Developers ===
> 
> Some committers are contributing as part of their jobs, but as we move
> to a more diverse set of developers we expect a good mix of salaried
> and volunteer time.
> 
> === Relationships with Other Apache Projects ===
> 
> S4 relies on the following Apache projects:
> 
> * BCEL (bytecode generation library)
> * commons cli (command line interface)
> * commons logging (needed by some other dependency)
> * log4j
> * commons jexl (expression processing)
> * zookeeper
> * Maven and its usual plug-ins (build time only)
> 
> Compared to existing projects, S4 complements existing functionality
> in a few ways summarized below:
> * Flume: S4 processes streams in a distributed fashion and enables
> applications to form arbitrary graphs of processing elements. Flume
> focuses on accumulating streams of logs in a centalized repository for
> batch processing;
> * Kafka: Kafka is a pub/sub messaging layer that interposes
> generation of events and processing, while S4 itself forwards events
> and processes them in a stream fashion.
> * Hadoop: Hadoop focuses on batch processing of large data sets,
> while S4 is a platform for stream processing of events. We would like
> to implement extensions that enable processing in both platforms with
> the same code.
> 
> === An Excessive Fascination with the Apache Brand ===
> 
> The project has already received a significant amount of attention and
> so far has been associated with Yahoo!. We would like, however, to
> foster the development of a community around S4 that evolves
> independently of the interests of a single company. Given the reliance
> of S4 on some Apache projects and the principles promoted by the
> foundation, we find it a suitable home for the project.
> 
> == Documentation ==
> 
> * S4 Website: http://s4.io
> * S4 documentation: http://docs.s4.io/
> * S4 Forum: http://groups.google.com/group/s4-project/topics
> * S4 Mailing list (with archives): http://groups.google.com/group/s4-project
> 
> == Source and Intellectual Property Submission Plan ==
> 
> The S4 source code is already licensed under Apache Software License
> 2.0. The source code is available at https://github.com/s4
> 
> 
> == External Dependencies ==
> 
> * asm (3-clause BSD license)
> * json (json.org's own license
> http://www.crockford.com/JSON/license.html which is acceptable as per
> Apache FAQ: http://www.apache.org/legal/resolved.html#json)
> * kryo (4-clause BSD license)
> * spring framework (Apache license - v 2)
> * codehaus jackson (Apache license)
> * junit (Common Public License - v 1.0)
> 
> == Cryptography ==
> None
> 
> == Required Resources ==
> 
> === Mailing lists ===
> * s4-dev
> * s4-user
> * s4-private (with moderated subscriptions)
> * s4-commit
> 
> === Subversion Directory ===
> 
> https://svn.apache.org/repos/asf/incubator/s4
> 
> === Issue Tracking ===
> 
> JIRA S4 (S4)
> 
> == Initial Committers ==
> * Kishore Gopalakrishna (kg at s4 dot io)
> * Flavio Junqueira (fpj at s4 dot io)
> * Matthieu Morel (mm at s4 dot io)
> * Anish Nair (an at s4 dot com)
> * Leo Neumeyer (leo at s4 dot io)
> * Bruce Robbins (br at s4 dot io)
> 
> == Affiliations ==
> * Kishore Gopalakrishna, Linkedin
> * Flavio Junqueira, Yahoo!
> * Matthieu Morel, Yahoo!
> * Anish Nair, A9
> * Leo Neumeyer, Quantbench
> * Bruce Robbins, Yahoo!
> 
> == Sponsors ==
> 
> === Champion ===
> 
> * Patrick Hunt
> 
> === Nominated Mentors ===
> 
> * Patrick Hunt
> * Owen O’Malley
> * Arun Murthy
> 
> === Sponsoring Entity ===
> 
> * Apache Incubator PMC
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message