Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 89162785F for ; Mon, 26 Sep 2011 16:47:26 +0000 (UTC) Received: (qmail 58690 invoked by uid 500); 26 Sep 2011 16:47:25 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 58487 invoked by uid 500); 26 Sep 2011 16:47:25 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 58479 invoked by uid 99); 26 Sep 2011 16:47:25 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Sep 2011 16:47:25 +0000 Received: from localhost (HELO mail-gx0-f175.google.com) (127.0.0.1) (smtp-auth username phunt, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Sep 2011 16:47:25 +0000 Received: by gxk4 with SMTP id 4so5575348gxk.6 for ; Mon, 26 Sep 2011 09:47:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.190.130 with SMTP id e2mr4540052yhn.107.1317055644391; Mon, 26 Sep 2011 09:47:24 -0700 (PDT) Received: by 10.236.34.170 with HTTP; Mon, 26 Sep 2011 09:47:24 -0700 (PDT) In-Reply-To: References: Date: Mon, 26 Sep 2011 09:47:24 -0700 Message-ID: Subject: Re: [VOTE] S4 to join the Incubator From: Patrick Hunt To: general@incubator.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable This passes, with 16 +1 votes, plenty of them binding, and no -1 votes. Thanks to all who voted! We can now get started creating the Apache S4 podling. Patrick On Tue, Sep 20, 2011 at 1:56 PM, Patrick Hunt wrote: > It's been a nearly a week since the S4 proposal was submitted for > discussion. =A0A few questions were asked, and the proposal was clarified > in response. =A0Sufficient mentors have volunteered. =A0I thus feel we ar= e > now ready for a vote. > > The latest proposal can be found at the end of this email and at: > > =A0http://wiki.apache.org/incubator/S4Proposal > > The discussion regarding the proposal can be found at: > > =A0http://s.apache.org/RMU > > Please cast your votes: > > [ =A0] +1 Accept S4 for incubation > [ =A0] +0 Indifferent to S4 incubation > [ =A0] -1 Reject S4 for incubation > > This vote will close 72 hours from now. > > Thanks, > > Patrick > > ------------------ > =3D S4 Proposal =3D > > =3D=3D Abstract =3D=3D > > S4 (Simple Scalable Streaming System) is a general-purpose, > distributed, scalable, partially fault-tolerant, pluggable platform > that allows programmers to easily develop applications for processing > continuous, unbounded streams of data. > > =3D=3D Proposal =3D=3D > > S4 is a software platform written in Java. Clients that send and > receive events can be written in any programming language. S4 also > includes a collection of modules called Processing Elements (or PEs > for short) that implement basic functionality and can be used by > application developers. In S4, keyed data events are routed with > affinity to Processing Elements (PEs), which consume the events and do > one or both of the following: (1) ''emit'' one or more events which > may be consumed by other PEs, (2) ''publish'' results. The > architecture resembles the Actors model, providing semantics of > encapsulation and location transparency, thus allowing applications to > be massively concurrent while exposing a simple programming =A0interface > to application developers. > > To drive adoption and increase the number of contributors to the > project, we may need to prioritize the focus based on feedback from > the community. We believe that one of the top priorities and driving > design principle for the S4 project is to provide a simple API that > hides most of the complexity associated with distributed systems and > concurrency. The project grew out of the need to provide a flexible > platform for application developers and scientists that can be used > for quick experimentation and production. > > S4 differs from existing Apache projects in a number of fundamental > ways. Flume is an Incubator project that focuses on log processing, > performing lightweight processing in a distributed fashion and > accumulating log data in a centralized repository for batch > processing. S4 instead performs all stream processing in a distributed > fashion and enables applications to form arbitrary graphs to process > streams of events. We see Flume as a complementary project. We also > expect S4 to complement Hadoop processing and in some cases to > supersede it. Kafka is another Incubator project that focuses on > processing large amounts of stream data. The design of Kafka, however, > follows the pub-sub paradigm, which focuses on delivering messages > containing arbitrary data from source processes (publishers) to > consumer processes (subscribers). Compared to S4, Kafka is an > intermediate step between data generation and processing, while S4 is > itself a platform for processing streams of events. > > S4 overall addresses a need of existing applications to process > streams of events beyond moving data to a centralized repository for > batch processing. It complements the features of existing Apache > projects, such as Hadoop, Flume, and Kafka, by providing a flexible > platform for distributed event processing. > > =3D=3D Background =3D=3D > > S4 was initially developed at Yahoo! Labs starting in 2008 to process > user feedback in the context of search advertising. The project was > licensed under the Apache License version 2.0 in October 2010. The > project documentation is currently available at http://s4.io . > > =3D=3D Rationale =3D=3D > > Stream computing has been growing steadily over the last 20 years. > However, recently there has been an explosion in real-time data > sources including the Web, sensor networks, financial securities > analysis and trading, traffic monitoring, natural language processing > of news and social data, and much more. > > As Hadoop evolved as a standard open source solution for batch > processing of massive data sets, there is no equivalent community > supported open source platform for processing data streams in > real-time. While various research projects have evolved into > proprietary commercial products, S4 has the potential to fill the gap. > Many projects that require a scalable stream processing architecture > currently use Hadoop by segmenting the input stream into data batches. > This solution is not efficient, results in high latency, and > introduces unnecessary complexity. > > The S4 design is primarily driven by large scale applications for data > mining and machine learning in a production environment. We think that > the S4 design is surprisingly flexible and lends itself to run in > large clusters built with commodity hardware. > > S4 enables application programmers to focus more on the application > and less on the infrastructure. S4 also provides a consistent graph > oriented programming model that, if widely adopted, will facilitate > sharing of basic component across developers. > > =3D=3D Initial Goals =3D=3D > > The basic S4 infrastructure is complete and can be used in real-world > applications. However, many additional components need to be developed > and improved. Some areas we hope to focus on in Apache: > > =A0* Add a reliable communication protocol option to the communication > layer for low bandwidth control messages that require guaranteed > delivery. > =A0* Higher-performance serialization and inter-node communication. > =A0* Functionality to save the state of PEs at runtime transparently and > restore it at startup. > =A0* Intelligent load shedding strategies. > =A0* Dynamic load balancing to make it possible to add and remove nodes > from the cluster without data loss. > =A0* Dynamic application loading and unloading. > =A0* Migration to a pure object-oriented design that takes advantage of > Java static typing using Generics in the framework code. (Keep it > simple for the application developer.) > =A0* Eliminate string identifiers and XML configuration. > =A0* Adopt JSR 330 (Dependency Injection for Java). > =A0* Add real-time query support. > =A0* Add a cluster management system. > > Clearly this is a long list but sets the high level roadmap for the proje= ct. > > =3D=3D Current Status =3D=3D > > The project has been under development at Yahoo! since late 2008, and > it was open sourced in October 2010. Since then we have received > patches from developers, started a discussion forum, and improved the > documentation. > > =3D=3D=3D Meritocracy =3D=3D=3D > > The S4 project was initially developed at Yahoo! Labs, a > research-oriented organization that values original ideas and > individual contributions. The design evolved in a bottom up fashion, > where decisions were driven by the application and the long-term > viability and flexibility of the platform. Once the project became > open-source it continued to be managed by those who were actively > doing the work. > > =3D=3D=3D Community =3D=3D=3D > > S4 is currently in use internally at Yahoo!, and since it was released > as an open source project it has received positive feedback and > contributions from developers. > > =3D=3D=3D Core Developers =3D=3D=3D > > S4 developers span a few companies and work on a voluntary basis. We > expect to have developers from other organizations joining the team in > the next few months, especially if S4 joins the Apache Incubator > project. Being an Apache Incubator project is likely to attract the > attention of more talented developers. > > One interesting aspect of the current group of developers is the > diverse background: > > =A0* Kishore Gopalakrishna was the main developer of the communication > layer and the integration with Zookeeper. He has been an active > contributor to Hadoop. > =A0* Flavio Junqueira has a background in distributed computing. He is a > committer of ZooKeeper, a ZooKeeper PMC member, and a committer of > BookKeeper; > =A0* Matthieu Morel has extensive background in distributed systems, he > likes theory and loves to implement things. He has been the main > designer and implementor of S4 checkpointing.* Anish Nair has been the > project=92s main customer. With his background on natural language > processing and algorithms he developed the applications that drove the > S4 design including processing of social feeds and real-time > recommendation engines. > =A0* Leo Neumeyer has a background in signal processing and statistical > modeling but has been advocating clean simple software design > throughout his career. At Yahoo! he conceived and championed the S4 > project as a solution to improve monetization in search advertising. > =A0* Bruce Robbins has been the main S4 developer, taking the concept > from idea to releases. Bruce engineering experience ranges from > programming Mainframe computers to assembly code. > > =3D=3D=3D Alignment =3D=3D=3D > > S4 brings stream processing capabilities that complement Hadoop's > batch processing capabilities. > > =3D=3D Known Risks =3D=3D > > =3D=3D=3D Orphaned Products =3D=3D=3D > > S4 has been used in production at Yahoo! and is being evaluated by > other organizations. The developers have continued to support the > project on their own time. We believe that adoption will increase > significantly as more tools and documentation become available. As the > project evolves, we may see new ideas that we may want to adopt or, if > it makes sense and is practical, we may want to merge two or more open > source projects. We believe that there is a clear need to have a well > supported open source stream processing platform and therefore, there > is low risk of the project becoming orphan. However, we are open to > combining projects in order to have fewer projects with a more active > community. Ultimately, this will be decided by the design ideas, the > implementation quality, and the adoption. > > =3D=3D=3D Inexperience with Open Source =3D=3D=3D > > The S4 code was open sourced by Yahoo! under Apache 2.0 license. One > committer of the S4 project, Flavio Junqueira, is intimately familiar > with the Apache model for open-source development and is experienced > with working with new contributors. =A0Flavio is both a committer a PMC > member for ZooKeeper. The other developers have had experience as > contributors in other open-source projects. Most of the original S4 > developers continue to be committers. > > =3D=3D=3D Homogeneous Developers =3D=3D=3D > > The initial set of committers for S4 represent four different > companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse > enough for a starting project. > > =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D > > Some committers are contributing as part of their jobs, but as we move > to a more diverse set of developers we expect a good mix of salaried > and volunteer time. > > =3D=3D=3D Relationships with Other Apache Projects =3D=3D=3D > > S4 relies on the following Apache projects: > > =A0* BCEL (bytecode generation library) > =A0* commons cli (command line interface) > =A0* commons logging (needed by some other dependency) > =A0* log4j > =A0* commons jexl (expression processing) > =A0* zookeeper > =A0* Maven and its usual plug-ins (build time only) > > Compared to existing projects, S4 complements existing functionality > in a few ways summarized below: > =A0* Flume: S4 processes streams in a distributed fashion and enables > applications to form arbitrary graphs of processing elements. Flume > focuses on accumulating streams of logs in a centalized repository for > batch processing; > =A0* Kafka: Kafka is a pub/sub messaging layer that interposes > generation of events and processing, while S4 itself forwards events > and processes them in a stream fashion. > =A0* Hadoop: Hadoop focuses on batch processing of large data sets, > while S4 is a platform for stream processing of events. We would like > to implement extensions that enable processing in both platforms with > the same code. > > =3D=3D=3D An Excessive Fascination with the Apache Brand =3D=3D=3D > > The project has already received a significant amount of attention and > so far has been associated with Yahoo!. We would like, however, to > foster the development of a community around S4 that evolves > independently of the interests of a single company. Given the reliance > of S4 on some Apache projects and the principles promoted by the > foundation, we find it a suitable home for the project. > > =3D=3D Documentation =3D=3D > > =A0* S4 Website: http://s4.io > =A0* S4 documentation: http://docs.s4.io/ > =A0* S4 Forum: http://groups.google.com/group/s4-project/topics > =A0* S4 Mailing list (with archives): http://groups.google.com/group/s4-p= roject > > =3D=3D Source and Intellectual Property Submission Plan =3D=3D > > The S4 source code is already licensed under Apache Software License > 2.0. The source code is available at https://github.com/s4 > > > =3D=3D External Dependencies =3D=3D > > =A0* asm (3-clause BSD license) > =A0* json (json.org's own license > http://www.crockford.com/JSON/license.html which is acceptable as per > Apache FAQ: http://www.apache.org/legal/resolved.html#json) > =A0* kryo (4-clause BSD license) > =A0* spring framework (Apache license - v 2) > =A0* codehaus jackson (Apache license) > =A0* junit (Common Public License - v 1.0) > > =3D=3D Cryptography =3D=3D > None > > =3D=3D Required Resources =3D=3D > > =3D=3D=3D Mailing lists =3D=3D=3D > =A0* s4-dev > =A0* s4-user > =A0* s4-private (with moderated subscriptions) > =A0* s4-commit > > =3D=3D=3D Subversion Directory =3D=3D=3D > > https://svn.apache.org/repos/asf/incubator/s4 > > =3D=3D=3D Issue Tracking =3D=3D=3D > > JIRA S4 (S4) > > =3D=3D Initial Committers =3D=3D > =A0* Kishore Gopalakrishna (kg at s4 dot io) > =A0* Flavio Junqueira (fpj at s4 dot io) > =A0* Matthieu Morel (mm at s4 dot io) > =A0* Anish Nair (an at s4 dot com) > =A0* Leo Neumeyer (leo at s4 dot io) > =A0* Bruce Robbins (br at s4 dot io) > > =3D=3D Affiliations =3D=3D > =A0* Kishore Gopalakrishna, Linkedin > =A0* Flavio Junqueira, Yahoo! > =A0* Matthieu Morel, Yahoo! > =A0* Anish Nair, A9 > =A0* Leo Neumeyer, Quantbench > =A0* Bruce Robbins, Yahoo! > > =3D=3D Sponsors =3D=3D > > =3D=3D=3D Champion =3D=3D=3D > > =A0* Patrick Hunt > > =3D=3D=3D Nominated Mentors =3D=3D=3D > > =A0* Patrick Hunt > =A0* Owen O=92Malley > =A0* Arun Murthy > > =3D=3D=3D Sponsoring Entity =3D=3D=3D > > =A0* Apache Incubator PMC > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org