incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [VOTE] S4 to join the Incubator
Date Wed, 21 Sep 2011 09:14:40 GMT
+1 (binding)

Regards
JB

On 09/21/2011 11:04 AM, Olivier Lamy wrote:
> +1 (binding)
>
> 2011/9/20 Patrick Hunt<phunt@apache.org>:
>> It's been a nearly a week since the S4 proposal was submitted for
>> discussion.  A few questions were asked, and the proposal was clarified
>> in response.  Sufficient mentors have volunteered.  I thus feel we are
>> now ready for a vote.
>>
>> The latest proposal can be found at the end of this email and at:
>>
>>   http://wiki.apache.org/incubator/S4Proposal
>>
>> The discussion regarding the proposal can be found at:
>>
>>   http://s.apache.org/RMU
>>
>> Please cast your votes:
>>
>> [  ] +1 Accept S4 for incubation
>> [  ] +0 Indifferent to S4 incubation
>> [  ] -1 Reject S4 for incubation
>>
>> This vote will close 72 hours from now.
>>
>> Thanks,
>>
>> Patrick
>>
>> ------------------
>> = S4 Proposal =
>>
>> == Abstract ==
>>
>> S4 (Simple Scalable Streaming System) is a general-purpose,
>> distributed, scalable, partially fault-tolerant, pluggable platform
>> that allows programmers to easily develop applications for processing
>> continuous, unbounded streams of data.
>>
>> == Proposal ==
>>
>> S4 is a software platform written in Java. Clients that send and
>> receive events can be written in any programming language. S4 also
>> includes a collection of modules called Processing Elements (or PEs
>> for short) that implement basic functionality and can be used by
>> application developers. In S4, keyed data events are routed with
>> affinity to Processing Elements (PEs), which consume the events and do
>> one or both of the following: (1) ''emit'' one or more events which
>> may be consumed by other PEs, (2) ''publish'' results. The
>> architecture resembles the Actors model, providing semantics of
>> encapsulation and location transparency, thus allowing applications to
>> be massively concurrent while exposing a simple programming  interface
>> to application developers.
>>
>> To drive adoption and increase the number of contributors to the
>> project, we may need to prioritize the focus based on feedback from
>> the community. We believe that one of the top priorities and driving
>> design principle for the S4 project is to provide a simple API that
>> hides most of the complexity associated with distributed systems and
>> concurrency. The project grew out of the need to provide a flexible
>> platform for application developers and scientists that can be used
>> for quick experimentation and production.
>>
>> S4 differs from existing Apache projects in a number of fundamental
>> ways. Flume is an Incubator project that focuses on log processing,
>> performing lightweight processing in a distributed fashion and
>> accumulating log data in a centralized repository for batch
>> processing. S4 instead performs all stream processing in a distributed
>> fashion and enables applications to form arbitrary graphs to process
>> streams of events. We see Flume as a complementary project. We also
>> expect S4 to complement Hadoop processing and in some cases to
>> supersede it. Kafka is another Incubator project that focuses on
>> processing large amounts of stream data. The design of Kafka, however,
>> follows the pub-sub paradigm, which focuses on delivering messages
>> containing arbitrary data from source processes (publishers) to
>> consumer processes (subscribers). Compared to S4, Kafka is an
>> intermediate step between data generation and processing, while S4 is
>> itself a platform for processing streams of events.
>>
>> S4 overall addresses a need of existing applications to process
>> streams of events beyond moving data to a centralized repository for
>> batch processing. It complements the features of existing Apache
>> projects, such as Hadoop, Flume, and Kafka, by providing a flexible
>> platform for distributed event processing.
>>
>> == Background ==
>>
>> S4 was initially developed at Yahoo! Labs starting in 2008 to process
>> user feedback in the context of search advertising. The project was
>> licensed under the Apache License version 2.0 in October 2010. The
>> project documentation is currently available at http://s4.io .
>>
>> == Rationale ==
>>
>> Stream computing has been growing steadily over the last 20 years.
>> However, recently there has been an explosion in real-time data
>> sources including the Web, sensor networks, financial securities
>> analysis and trading, traffic monitoring, natural language processing
>> of news and social data, and much more.
>>
>> As Hadoop evolved as a standard open source solution for batch
>> processing of massive data sets, there is no equivalent community
>> supported open source platform for processing data streams in
>> real-time. While various research projects have evolved into
>> proprietary commercial products, S4 has the potential to fill the gap.
>> Many projects that require a scalable stream processing architecture
>> currently use Hadoop by segmenting the input stream into data batches.
>> This solution is not efficient, results in high latency, and
>> introduces unnecessary complexity.
>>
>> The S4 design is primarily driven by large scale applications for data
>> mining and machine learning in a production environment. We think that
>> the S4 design is surprisingly flexible and lends itself to run in
>> large clusters built with commodity hardware.
>>
>> S4 enables application programmers to focus more on the application
>> and less on the infrastructure. S4 also provides a consistent graph
>> oriented programming model that, if widely adopted, will facilitate
>> sharing of basic component across developers.
>>
>> == Initial Goals ==
>>
>> The basic S4 infrastructure is complete and can be used in real-world
>> applications. However, many additional components need to be developed
>> and improved. Some areas we hope to focus on in Apache:
>>
>>   * Add a reliable communication protocol option to the communication
>> layer for low bandwidth control messages that require guaranteed
>> delivery.
>>   * Higher-performance serialization and inter-node communication.
>>   * Functionality to save the state of PEs at runtime transparently and
>> restore it at startup.
>>   * Intelligent load shedding strategies.
>>   * Dynamic load balancing to make it possible to add and remove nodes
>> from the cluster without data loss.
>>   * Dynamic application loading and unloading.
>>   * Migration to a pure object-oriented design that takes advantage of
>> Java static typing using Generics in the framework code. (Keep it
>> simple for the application developer.)
>>   * Eliminate string identifiers and XML configuration.
>>   * Adopt JSR 330 (Dependency Injection for Java).
>>   * Add real-time query support.
>>   * Add a cluster management system.
>>
>> Clearly this is a long list but sets the high level roadmap for the project.
>>
>> == Current Status ==
>>
>> The project has been under development at Yahoo! since late 2008, and
>> it was open sourced in October 2010. Since then we have received
>> patches from developers, started a discussion forum, and improved the
>> documentation.
>>
>> === Meritocracy ===
>>
>> The S4 project was initially developed at Yahoo! Labs, a
>> research-oriented organization that values original ideas and
>> individual contributions. The design evolved in a bottom up fashion,
>> where decisions were driven by the application and the long-term
>> viability and flexibility of the platform. Once the project became
>> open-source it continued to be managed by those who were actively
>> doing the work.
>>
>> === Community ===
>>
>> S4 is currently in use internally at Yahoo!, and since it was released
>> as an open source project it has received positive feedback and
>> contributions from developers.
>>
>> === Core Developers ===
>>
>> S4 developers span a few companies and work on a voluntary basis. We
>> expect to have developers from other organizations joining the team in
>> the next few months, especially if S4 joins the Apache Incubator
>> project. Being an Apache Incubator project is likely to attract the
>> attention of more talented developers.
>>
>> One interesting aspect of the current group of developers is the
>> diverse background:
>>
>>   * Kishore Gopalakrishna was the main developer of the communication
>> layer and the integration with Zookeeper. He has been an active
>> contributor to Hadoop.
>>   * Flavio Junqueira has a background in distributed computing. He is a
>> committer of ZooKeeper, a ZooKeeper PMC member, and a committer of
>> BookKeeper;
>>   * Matthieu Morel has extensive background in distributed systems, he
>> likes theory and loves to implement things. He has been the main
>> designer and implementor of S4 checkpointing.* Anish Nair has been the
>> project’s main customer. With his background on natural language
>> processing and algorithms he developed the applications that drove the
>> S4 design including processing of social feeds and real-time
>> recommendation engines.
>>   * Leo Neumeyer has a background in signal processing and statistical
>> modeling but has been advocating clean simple software design
>> throughout his career. At Yahoo! he conceived and championed the S4
>> project as a solution to improve monetization in search advertising.
>>   * Bruce Robbins has been the main S4 developer, taking the concept
>> from idea to releases. Bruce engineering experience ranges from
>> programming Mainframe computers to assembly code.
>>
>> === Alignment ===
>>
>> S4 brings stream processing capabilities that complement Hadoop's
>> batch processing capabilities.
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> S4 has been used in production at Yahoo! and is being evaluated by
>> other organizations. The developers have continued to support the
>> project on their own time. We believe that adoption will increase
>> significantly as more tools and documentation become available. As the
>> project evolves, we may see new ideas that we may want to adopt or, if
>> it makes sense and is practical, we may want to merge two or more open
>> source projects. We believe that there is a clear need to have a well
>> supported open source stream processing platform and therefore, there
>> is low risk of the project becoming orphan. However, we are open to
>> combining projects in order to have fewer projects with a more active
>> community. Ultimately, this will be decided by the design ideas, the
>> implementation quality, and the adoption.
>>
>> === Inexperience with Open Source ===
>>
>> The S4 code was open sourced by Yahoo! under Apache 2.0 license. One
>> committer of the S4 project, Flavio Junqueira, is intimately familiar
>> with the Apache model for open-source development and is experienced
>> with working with new contributors.  Flavio is both a committer a PMC
>> member for ZooKeeper. The other developers have had experience as
>> contributors in other open-source projects. Most of the original S4
>> developers continue to be committers.
>>
>> === Homogeneous Developers ===
>>
>> The initial set of committers for S4 represent four different
>> companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse
>> enough for a starting project.
>>
>> === Reliance on Salaried Developers ===
>>
>> Some committers are contributing as part of their jobs, but as we move
>> to a more diverse set of developers we expect a good mix of salaried
>> and volunteer time.
>>
>> === Relationships with Other Apache Projects ===
>>
>> S4 relies on the following Apache projects:
>>
>>   * BCEL (bytecode generation library)
>>   * commons cli (command line interface)
>>   * commons logging (needed by some other dependency)
>>   * log4j
>>   * commons jexl (expression processing)
>>   * zookeeper
>>   * Maven and its usual plug-ins (build time only)
>>
>> Compared to existing projects, S4 complements existing functionality
>> in a few ways summarized below:
>>   * Flume: S4 processes streams in a distributed fashion and enables
>> applications to form arbitrary graphs of processing elements. Flume
>> focuses on accumulating streams of logs in a centalized repository for
>> batch processing;
>>   * Kafka: Kafka is a pub/sub messaging layer that interposes
>> generation of events and processing, while S4 itself forwards events
>> and processes them in a stream fashion.
>>   * Hadoop: Hadoop focuses on batch processing of large data sets,
>> while S4 is a platform for stream processing of events. We would like
>> to implement extensions that enable processing in both platforms with
>> the same code.
>>
>> === An Excessive Fascination with the Apache Brand ===
>>
>> The project has already received a significant amount of attention and
>> so far has been associated with Yahoo!. We would like, however, to
>> foster the development of a community around S4 that evolves
>> independently of the interests of a single company. Given the reliance
>> of S4 on some Apache projects and the principles promoted by the
>> foundation, we find it a suitable home for the project.
>>
>> == Documentation ==
>>
>>   * S4 Website: http://s4.io
>>   * S4 documentation: http://docs.s4.io/
>>   * S4 Forum: http://groups.google.com/group/s4-project/topics
>>   * S4 Mailing list (with archives): http://groups.google.com/group/s4-project
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>> The S4 source code is already licensed under Apache Software License
>> 2.0. The source code is available at https://github.com/s4
>>
>>
>> == External Dependencies ==
>>
>>   * asm (3-clause BSD license)
>>   * json (json.org's own license
>> http://www.crockford.com/JSON/license.html which is acceptable as per
>> Apache FAQ: http://www.apache.org/legal/resolved.html#json)
>>   * kryo (4-clause BSD license)
>>   * spring framework (Apache license - v 2)
>>   * codehaus jackson (Apache license)
>>   * junit (Common Public License - v 1.0)
>>
>> == Cryptography ==
>> None
>>
>> == Required Resources ==
>>
>> === Mailing lists ===
>>   * s4-dev
>>   * s4-user
>>   * s4-private (with moderated subscriptions)
>>   * s4-commit
>>
>> === Subversion Directory ===
>>
>> https://svn.apache.org/repos/asf/incubator/s4
>>
>> === Issue Tracking ===
>>
>> JIRA S4 (S4)
>>
>> == Initial Committers ==
>>   * Kishore Gopalakrishna (kg at s4 dot io)
>>   * Flavio Junqueira (fpj at s4 dot io)
>>   * Matthieu Morel (mm at s4 dot io)
>>   * Anish Nair (an at s4 dot com)
>>   * Leo Neumeyer (leo at s4 dot io)
>>   * Bruce Robbins (br at s4 dot io)
>>
>> == Affiliations ==
>>   * Kishore Gopalakrishna, Linkedin
>>   * Flavio Junqueira, Yahoo!
>>   * Matthieu Morel, Yahoo!
>>   * Anish Nair, A9
>>   * Leo Neumeyer, Quantbench
>>   * Bruce Robbins, Yahoo!
>>
>> == Sponsors ==
>>
>> === Champion ===
>>
>>   * Patrick Hunt
>>
>> === Nominated Mentors ===
>>
>>   * Patrick Hunt
>>   * Owen O’Malley
>>   * Arun Murthy
>>
>> === Sponsoring Entity ===
>>
>>   * Apache Incubator PMC
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
>
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message