Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 978C710439 for ; Wed, 31 Jul 2013 22:00:22 +0000 (UTC) Received: (qmail 12310 invoked by uid 500); 31 Jul 2013 22:00:21 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 12134 invoked by uid 500); 31 Jul 2013 22:00:21 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 12125 invoked by uid 99); 31 Jul 2013 22:00:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 22:00:21 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of criccomini.aux@gmail.com designates 209.85.128.178 as permitted sender) Received: from [209.85.128.178] (HELO mail-ve0-f178.google.com) (209.85.128.178) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 22:00:17 +0000 Received: by mail-ve0-f178.google.com with SMTP id ox1so1480147veb.37 for ; Wed, 31 Jul 2013 14:59:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=NqNGErGUyS87UEWB36jBllyrGOJONvv0IubE8+mIudE=; b=PlfMIwO8JOnU3AopZSlDI8eagGNdZo1KaShnqdC9WnaNvQGlB7wfTQc14s+fdDHS7+ lb1OOAfV1YMsCYbsrE21YSYBf0Tl+SOqX07yqwtQhMWmhKwa1aaRzVoB22ch37q2oTRM aGliplV1kHlFzffwoFU7Xu+OZ2eQwCJg48uVfZiytCZ8H8CJSWl5AFPtiAoJnc7ZjoHB xEA4vUlTgfGXSYHGXk1VA2v60tZIZBU8jzTis7EkDiM1RwjBBcUIbMdMrsUbini7ODTv m3h+dwjvaPTVpr6jBXKbTvgirpaK92qvG/vLXBDhdUyFyowZj2sO8tO7KtNjR8t8sqCR 4YYQ== MIME-Version: 1.0 X-Received: by 10.58.29.111 with SMTP id j15mr29518316veh.76.1375307996374; Wed, 31 Jul 2013 14:59:56 -0700 (PDT) Received: by 10.58.202.225 with HTTP; Wed, 31 Jul 2013 14:59:56 -0700 (PDT) In-Reply-To: References: Date: Wed, 31 Jul 2013 14:59:56 -0700 Message-ID: Subject: Re: [PROPOSAL] Samza Proposal From: Chris Riccomini To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=047d7b6d9de094ab2d04e2d5d8fb X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6d9de094ab2d04e2d5d8fb Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey Guys, Jakob (the project Champion) is in the process of getting all of the resources requested in our proposal (JIRA, Hudson, webspace, etc). As soon as we have webspace allocated, we'll put the Samza site up, which has all of these docs on it. Henry, as you said, I'll follow up with this thread when they're up. Cheers, Chris On Wed, Jul 31, 2013 at 2:04 PM, Henry Saputra wro= te: > Well, usually VOTE is conducted after discussion had calmed down. Looks > like this time the VOTE starts even though there were some question about > the proposal. > > Would be great to actually add links to the comparisons in the thread eve= n > though the VOTE had concluded. > > - Henry > > > On Wed, Jul 31, 2013 at 1:29 PM, Phillip Rhodes > wrote: > > > Same here. Not that it matters as far as admission to the incubator > > (that vote is over now anyway), but I think a lot of people (including > > potential users of Samza) would like to see more about how it compares > > & contrasts with other stream oriented systems. > > > > > > Phil > > This message optimized for indexing by NSA PRISM > > > > > > On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu > > wrote: > > > +1 > > > > > > I would love to see the "documents comparing and contrasting Samza wi= th > > > MUPD8 and Storm." > > > > > > > > > On Sat, Jul 27, 2013 at 2:53 AM, Enis S=F6ztutar > wrote: > > > > > >> +1 on incubation. > > >> > > >> Enis > > >> > > >> > > >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini > > >> wrote: > > >> > > >> > Hey Henry and Debo, > > >> > > > >> > Thanks for calling this out. Samza's feature set includes: > > >> > > > >> > - *Simpe API:* Unlike most low-level messaging system APIs, Sam= za > > >> > provides a very simple call-back based "process message" API th= at > > >> > should be > > >> > familiar to anyone that's used Map/Reduce. > > >> > - *Managed state:* Samza manages snapshotting and restoration o= f > a > > >> > stream processor's state. Samza will restore a stream processor= 's > > >> state > > >> > to > > >> > a snapshot consistent with the processor's last read messages > when > > the > > >> > processor is restarted. > > >> > - *Fault tolerance:* Samza will work with YARN to restart your > > stream > > >> > processor if there is a machine or processor failure. > > >> > - Durability: Samza uses Kafka to guarantee that no messages wi= ll > > ever > > >> > be lost. > > >> > - *Scalability:* Samza is partitioned and distributed at every > > level. > > >> > Kafka provides ordered, partitioned, replayable, fault-tolerant > > >> streams. > > >> > YARN provides a distributed environment for Samza containers to > run > > >> in. > > >> > - *Pluggable:* Though Samza works out of the box with Kafka and > > YARN, > > >> > Samza provides a pluggable API that lets you run Samza with oth= er > > >> > messaging > > >> > systems and execution environments. > > >> > - *Processor isolation:* Samza works with Apache YARN, which > > supports > > >> > processor security through Hadoop's security model, and resourc= e > > >> > isolation > > >> > through Linux CGroups. > > >> > > > >> > Some of these feature are available in S4, and some are not. The > same > > >> holds > > >> > true for Storm. > > >> > > > >> > The open source stream processing systems that are available are > > actually > > >> > quite young, and no single system offers a complete solution. > Problems > > >> like > > >> > how a stream processor's state (e.g. counts) should be managed, > > whether a > > >> > stream should be buffered remotely on disk or not, what to do when > > >> > duplicate messages are received or messages are lost, and how to > model > > >> > underlying messaging systems are all pretty new. > > >> > > > >> > Samza's main differentiators are: > > >> > > > >> > - State is modeled as a stream. When a processor fails and is > > >> restarted, > > >> > the state stream is entirely replayed to restore it. > > >> > - Streams are ordered, partitioned, replayable, and fault > tolerant. > > >> > - YARN is used for processor isolation, security, and fault > > tolerance. > > >> > - All streams are materialized to Kafka. > > >> > > > >> > If you guys are interested, I have much more in-depth documents > > comparing > > >> > and contrasting Samza with MUPD8 and Storm. > > >> > > > >> > Cheers, > > >> > Chris > > >> > > > >> > > > >> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra < > > henry.saputra@gmail.com > > >> > >wrote: > > >> > > > >> > > Looks like this is similar to S4 (http://incubator.apache.org/s4= / > ) > > >> which > > >> > > allow stream and real time data processing via DAG? > > >> > > > > >> > > > > >> > > - Henry > > >> > > > > >> > > > > >> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco < > > >> criccomini.aux@gmail.com > > >> > > >wrote: > > >> > > > > >> > > > Hey All, > > >> > > > > > >> > > > Sending along an incubator proposal for Samza. > > >> > > > > > >> > > > Thanks! > > >> > > > Chris > > >> > > > > > >> > > > https://wiki.apache.org/incubator/SamzaProposal > > >> > > > > > >> > > > -------------------------------------------- > > >> > > > > > >> > > > =3D=3D Abstract =3D=3D > > >> > > > > > >> > > > Samza is a stream processing system for running continuous > > >> computation > > >> > on > > >> > > > infinite streams of data. > > >> > > > > > >> > > > =3D=3D Proposal =3D=3D > > >> > > > > > >> > > > Samza provides a system for processing stream data from > > >> > publish-subscribe > > >> > > > systems such as Apache Kafka. The developer writes a stream > > >> processing > > >> > > > task, and executes it as a Samza job. Samza then routes messag= es > > >> > between > > >> > > > stream processing tasks and the publish-subscribe systems that > the > > >> > > messages > > >> > > > are addressed to. > > >> > > > > > >> > > > =3D=3D Background =3D=3D > > >> > > > > > >> > > > Samza was developed at LinkedIn to enable easier processing of > > >> > streaming > > >> > > > data on top of Apache Kafka. Current use cases include content > > >> > processing > > >> > > > pipelines, aggregating operational log data, data ingestion in= to > > >> > > > distributed database infrastructure, and measuring user activi= ty > > >> across > > >> > > > different aggregation types. > > >> > > > > > >> > > > Samza is focused on providing an easy to use framework to > process > > >> > > streams. > > >> > > > It uses Apache YARN to provide a mechanism for deploying strea= m > > >> > > processing > > >> > > > tasks in a distributed cluster. Samza also takes advantage of > > YARN to > > >> > > make > > >> > > > decisions about stream processor locality, co-partition of > > streams, > > >> and > > >> > > > provide security. Apache Kafka is also leveraged to provide a > > >> mechanism > > >> > > to > > >> > > > pass messages from one stream processor to the next. Apache > Kafka > > is > > >> > also > > >> > > > used to help manage a stream processor's state, so that it can > be > > >> > > recovered > > >> > > > in the event of a failure. > > >> > > > > > >> > > > Samza is written in Scala. It was developed internally at > > LinkedIn to > > >> > > meet > > >> > > > our particular use cases, but will be useful to many > organizations > > >> > > facing a > > >> > > > similar need to reliably process large amounts of streaming > data. > > >> > > > Therefore, we would like to share it the ASF and begin > developing > > a > > >> > > > community of developers and users within Apache. > > >> > > > > > >> > > > =3D=3D Rationale =3D=3D > > >> > > > > > >> > > > Many organizations can benefit from a reliable stream processi= ng > > >> system > > >> > > > such as Samza. While our use case of processing events from a > > large > > >> > > website > > >> > > > like LinkedIn has driven the design of Samza, its uses are > varied > > and > > >> > we > > >> > > > expect many new use cases to emerge. Samza provides a generic > API > > to > > >> > > > process messages from streaming infrastructure and will appeal > to > > >> many > > >> > > > users. > > >> > > > > > >> > > > =3D=3D Current Status =3D=3D > > >> > > > > > >> > > > =3D=3D=3D Meritocracy =3D=3D=3D > > >> > > > > > >> > > > Our intent with this incubator proposal is to start building a > > >> diverse > > >> > > > developer community around Samza following the Apache > meritocracy > > >> > model. > > >> > > > Since Samza was initially developed in late 2011, we have had > fast > > >> > > adoption > > >> > > > and contributions by multiple teams at LinkedIn. We plan to > > continue > > >> > > > support for new contributors and work with those who contribut= e > > >> > > > significantly to the project to make them committers. > > >> > > > > > >> > > > =3D=3D=3D Community =3D=3D=3D > > >> > > > > > >> > > > Samza is currently being used internally at LinkedIn. We hope = to > > >> extend > > >> > > our > > >> > > > contributor base significantly and invite all those who are > > >> interested > > >> > in > > >> > > > building large-scale distributed systems to participate. > > >> > > > > > >> > > > =3D=3D=3D Core Developers =3D=3D=3D > > >> > > > > > >> > > > Samza is currently being developed by four engineers at > LinkedIn: > > Jay > > >> > > > Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. > > Jakob is > > >> > an > > >> > > > ASF Member, Incubator PMC member and PMC member on Apache > Hadoop, > > >> Kafka > > >> > > and > > >> > > > Giraph. Jay is a member of the Apache Kafka PMC and contributo= r > to > > >> > > various > > >> > > > Apache projects. Chris has been an active contributor for > several > > >> > > projects > > >> > > > including Apache Kafka and Apache YARN. Sriram has contributed > to > > >> > Samza, > > >> > > as > > >> > > > well as Apache Kafka. > > >> > > > > > >> > > > =3D=3D=3D Alignment =3D=3D=3D > > >> > > > > > >> > > > The ASF is the natural choice to host the Samza project as its > > goal > > >> of > > >> > > > encouraging community-driven open-source projects fits with ou= r > > >> vision > > >> > > for > > >> > > > Samza. Additionally, many other projects with which we are > > familiar > > >> > with > > >> > > > and expect Samza to integrate with, such as Apache ZooKeeper, > > YARN, > > >> > HDFS > > >> > > > and log4j are hosted by the ASF and we will benefit and provid= e > > >> benefit > > >> > > by > > >> > > > close proximity to them. > > >> > > > > > >> > > > =3D=3D Known Risks =3D=3D > > >> > > > > > >> > > > =3D=3D=3D Orphaned Products =3D=3D=3D > > >> > > > > > >> > > > The core developers plan to work full time on the project. The= re > > is > > >> > very > > >> > > > little risk of Samza being abandoned as it is part of LinkedIn= 's > > >> > internal > > >> > > > infrastructure. > > >> > > > > > >> > > > =3D=3D=3D Inexperience with Open Source =3D=3D=3D > > >> > > > > > >> > > > All of the core developers have experience with open source > > >> > development. > > >> > > > Jay and Chris has been involved with several open source > projects > > >> > > released > > >> > > > by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has > > been > > >> > > > actively involved with the ASF as a full-time Hadoop committer > and > > >> PMC > > >> > > > member. Sriram is a contributor to Apache Kafka. > > >> > > > > > >> > > > =3D=3D=3D Homogeneous Developers =3D=3D=3D > > >> > > > > > >> > > > The current core developers are all from LinkedIn. However, we > > hope > > >> to > > >> > > > establish a developer community that includes contributors fro= m > > >> several > > >> > > > corporations and we actively encouraging new contributors via > the > > >> > mailing > > >> > > > lists and public presentations of Samza. > > >> > > > > > >> > > > =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D > > >> > > > > > >> > > > Currently, the developers are paid to do work on Samza. Howeve= r, > > once > > >> > the > > >> > > > project has a community built around it, we expect to get > > committers, > > >> > > > developers and community from outside the current core > developers. > > >> > > However, > > >> > > > because LinkedIn relies on Samza internally, the reliance on > > salaried > > >> > > > developers is unlikely to change. > > >> > > > > > >> > > > =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D > > >> > > > > > >> > > > Samza is deeply integrated with Apache products. Samza uses > Apache > > >> > Kafka > > >> > > as > > >> > > > its underlying message passing system. Samza also uses Apache > YARN > > >> for > > >> > > task > > >> > > > scheduling. Both YARN and Kafka, in turn, rely on Apache > ZooKeeper > > >> for > > >> > > > coordination. In addition, we hope to integrate with Apache HD= FS > > in > > >> the > > >> > > > near future. > > >> > > > > > >> > > > =3D=3D=3D An Excessive Fascination with the Apache Brand =3D= =3D=3D > > >> > > > > > >> > > > While we respect the reputation of the Apache brand and have n= o > > >> doubts > > >> > > that > > >> > > > it will attract contributors and users, our interest is > primarily > > to > > >> > give > > >> > > > Samza a solid home as an open source project following an > > established > > >> > > > development model. We have also given reasons in the Rationale > and > > >> > > > Alignment sections. > > >> > > > > > >> > > > =3D=3D Documentation =3D=3D > > >> > > > > > >> > > > http://wiki.apache.org/incubator/SamzaProposal > > >> > > > > > >> > > > =3D=3D Initial Source =3D=3D > > >> > > > > > >> > > > Available upon request. > > >> > > > > > >> > > > =3D=3D External Dependencies =3D=3D > > >> > > > > > >> > > > The dependencies all have Apache compatible licenses. > > >> > > > > > >> > > > * metrics (Apache 2.0) > > >> > > > * zkclient (Apache 2.0) > > >> > > > * zookeeper (Apache 2.0) > > >> > > > * jetty (Apache 2.0) > > >> > > > * jackson (Apache 2.0) > > >> > > > * commons-httpclient (Apache 2.0) > > >> > > > * slf4j (MIT) > > >> > > > * avro (Apache 2.0) > > >> > > > * hadoop (Apache 2.0) > > >> > > > * junit (Common Public License) > > >> > > > * grizzled-slf4j (BSD) > > >> > > > * scalatra ( > > >> https://github.com/scalatra/scalatra/blob/develop/LICENSE > > >> > ) > > >> > > > * scala (http://www.scala-lang.org/node/146) > > >> > > > * joptsimple (MIT) > > >> > > > * kafka (Apache 2.0) > > >> > > > * scalate (Apache 2.0) > > >> > > > * leveldb jni (BSD) > > >> > > > > > >> > > > =3D=3D Cryptography =3D=3D > > >> > > > > > >> > > > Samza will depend on secure Hadoop, which can optionally use > > >> Kerberos. > > >> > > > > > >> > > > =3D=3D Required Resources =3D=3D > > >> > > > > > >> > > > =3D=3D=3D Mailing Lists =3D=3D=3D > > >> > > > > > >> > > > samza-private for private PMC discussions (with moderated > > >> > subscriptions) > > >> > > > samza-dev > > >> > > > samza-commits > > >> > > > samza-user > > >> > > > > > >> > > > =3D=3D=3D Subversion Directory =3D=3D=3D > > >> > > > > > >> > > > Git is the preferred source control system: git:// > > >> git.apache.org/samza > > >> > > > > > >> > > > =3D=3D=3D Issue Tracking =3D=3D=3D > > >> > > > > > >> > > > JIRA Samza (SAMZA) > > >> > > > > > >> > > > =3D=3D=3D Other Resources =3D=3D=3D > > >> > > > > > >> > > > The existing code already has unit tests, so we would like a > > Hudson > > >> > > > instance to run them whenever a new patch is submitted. This c= an > > be > > >> > added > > >> > > > after project creation. > > >> > > > > > >> > > > =3D=3D Initial Committers =3D=3D > > >> > > > > > >> > > > * Jay Kreps > > >> > > > * Jakob Homan > > >> > > > * Chris Riccomini > > >> > > > * Sriram Subramanian > > >> > > > > > >> > > > =3D=3D Affiliations =3D=3D > > >> > > > > > >> > > > * Jay Kreps (LinkedIn) > > >> > > > * Jakob Homan (LinkedIn) > > >> > > > * Chris Riccomini (LinkedIn) > > >> > > > * Sriram Subramanian (LinkedIn) > > >> > > > > > >> > > > =3D=3D Sponsors =3D=3D > > >> > > > > > >> > > > =3D=3D=3D Champion =3D=3D=3D > > >> > > > > > >> > > > Jakob Homan (Apache Member) > > >> > > > > > >> > > > =3D=3D=3D Nominated Mentors =3D=3D=3D > > >> > > > > > >> > > > * Arun C Murthy > > >> > > > * Chris Douglas > > >> > > > * Roman Shaposhnik > > >> > > > > > >> > > > =3D=3D=3D Sponsoring Entity =3D=3D=3D > > >> > > > > > >> > > > We are requesting the Incubator to sponsor this project. > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > > > > > -- > > > Best Regards, > > > -- Alex > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > > For additional commands, e-mail: general-help@incubator.apache.org > > > > > --047d7b6d9de094ab2d04e2d5d8fb--