incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Debo Dutta (dedutta)" <dedu...@cisco.com>
Subject Re: [PROPOSAL] Samza Proposal
Date Wed, 24 Jul 2013 02:12:32 GMT
Also add storm to the mix. Storm also allows you to do back edges.

debo

On 7/23/13 6:48 PM, "Henry Saputra" <henry.saputra@gmail.com> wrote:

>Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
>allow stream and real time data processing via DAG?
>
>
>- Henry
>
>
>On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco
><criccomini.aux@gmail.com>wrote:
>
>> Hey All,
>>
>> Sending along an incubator proposal for Samza.
>>
>> Thanks!
>> Chris
>>
>> https://wiki.apache.org/incubator/SamzaProposal
>>
>> --------------------------------------------
>>
>> == Abstract ==
>>
>> Samza is a stream processing system for running continuous computation
>>on
>> infinite streams of data.
>>
>> == Proposal ==
>>
>> Samza provides a system for processing stream data from
>>publish-subscribe
>> systems such as Apache Kafka. The developer writes a stream processing
>> task, and executes it as a Samza job. Samza then routes messages between
>> stream processing tasks and the publish-subscribe systems that the
>>messages
>> are addressed to.
>>
>> == Background ==
>>
>> Samza was developed at LinkedIn to enable easier processing of streaming
>> data on top of Apache Kafka. Current use cases include content
>>processing
>> pipelines, aggregating operational log data, data ingestion into
>> distributed database infrastructure, and measuring user activity across
>> different aggregation types.
>>
>> Samza is focused on providing an easy to use framework to process
>>streams.
>> It uses Apache YARN to provide a mechanism for deploying stream
>>processing
>> tasks in a distributed cluster. Samza also takes advantage of YARN to
>>make
>> decisions about stream processor locality, co-partition of streams, and
>> provide security. Apache Kafka is also leveraged to provide a mechanism
>>to
>> pass messages from one stream processor to the next. Apache Kafka is
>>also
>> used to help manage a stream processor's state, so that it can be
>>recovered
>> in the event of a failure.
>>
>> Samza is written in Scala. It was developed internally at LinkedIn to
>>meet
>> our particular use cases, but will be useful to many organizations
>>facing a
>> similar need to reliably process large amounts of streaming data.
>> Therefore, we would like to share it the ASF and begin developing a
>> community of developers and users within Apache.
>>
>> == Rationale ==
>>
>> Many organizations can benefit from a reliable stream processing system
>> such as Samza. While our use case of processing events from a large
>>website
>> like LinkedIn has driven the design of Samza, its uses are varied and we
>> expect many new use cases to emerge. Samza provides a generic API to
>> process messages from streaming infrastructure and will appeal to many
>> users.
>>
>> == Current Status ==
>>
>> === Meritocracy ===
>>
>> Our intent with this incubator proposal is to start building a diverse
>> developer community around Samza following the Apache meritocracy model.
>> Since Samza was initially developed in late 2011, we have had fast
>>adoption
>> and contributions by multiple teams at LinkedIn. We plan to continue
>> support for new contributors and work with those who contribute
>> significantly to the project to make them committers.
>>
>> === Community ===
>>
>> Samza is currently being used internally at LinkedIn. We hope to extend
>>our
>> contributor base significantly and invite all those who are interested
>>in
>> building large-scale distributed systems to participate.
>>
>> === Core Developers ===
>>
>> Samza is currently being developed by four engineers at LinkedIn: Jay
>> Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
>> ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka
>>and
>> Giraph. Jay is a member of the Apache Kafka PMC and contributor to
>>various
>> Apache projects. Chris has been an active contributor for several
>>projects
>> including Apache Kafka and Apache YARN. Sriram has contributed to
>>Samza, as
>> well as Apache Kafka.
>>
>> === Alignment ===
>>
>> The ASF is the natural choice to host the Samza project as its goal of
>> encouraging community-driven open-source projects fits with our vision
>>for
>> Samza. Additionally, many other projects with which we are familiar with
>> and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
>> and log4j are hosted by the ASF and we will benefit and provide benefit
>>by
>> close proximity to them.
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> The core developers plan to work full time on the project. There is very
>> little risk of Samza being abandoned as it is part of LinkedIn's
>>internal
>> infrastructure.
>>
>> === Inexperience with Open Source ===
>>
>> All of the core developers have experience with open source development.
>> Jay and Chris has been involved with several open source projects
>>released
>> by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
>> actively involved with the ASF as a full-time Hadoop committer and PMC
>> member. Sriram is a contributor to Apache Kafka.
>>
>> === Homogeneous Developers ===
>>
>> The current core developers are all from LinkedIn. However, we hope to
>> establish a developer community that includes contributors from several
>> corporations and we actively encouraging new contributors via the
>>mailing
>> lists and public presentations of Samza.
>>
>> === Reliance on Salaried Developers ===
>>
>> Currently, the developers are paid to do work on Samza. However, once
>>the
>> project has a community built around it, we expect to get committers,
>> developers and community from outside the current core developers.
>>However,
>> because LinkedIn relies on Samza internally, the reliance on salaried
>> developers is unlikely to change.
>>
>> === Relationships with Other Apache Products ===
>>
>> Samza is deeply integrated with Apache products. Samza uses Apache
>>Kafka as
>> its underlying message passing system. Samza also uses Apache YARN for
>>task
>> scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper for
>> coordination. In addition, we hope to integrate with Apache HDFS in the
>> near future.
>>
>> === An Excessive Fascination with the Apache Brand ===
>>
>> While we respect the reputation of the Apache brand and have no doubts
>>that
>> it will attract contributors and users, our interest is primarily to
>>give
>> Samza a solid home as an open source project following an established
>> development model. We have also given reasons in the Rationale and
>> Alignment sections.
>>
>> == Documentation ==
>>
>> http://wiki.apache.org/incubator/SamzaProposal
>>
>> == Initial Source ==
>>
>> Available upon request.
>>
>> == External Dependencies ==
>>
>> The dependencies all have Apache compatible licenses.
>>
>>  * metrics (Apache 2.0)
>>  * zkclient (Apache 2.0)
>>  * zookeeper (Apache 2.0)
>>  * jetty (Apache 2.0)
>>  * jackson (Apache 2.0)
>>  * commons-httpclient (Apache 2.0)
>>  * slf4j (MIT)
>>  * avro (Apache 2.0)
>>  * hadoop (Apache 2.0)
>>  * junit (Common Public License)
>>  * grizzled-slf4j (BSD)
>>  * scalatra (https://github.com/scalatra/scalatra/blob/develop/LICENSE)
>>  * scala (http://www.scala-lang.org/node/146)
>>  * joptsimple (MIT)
>>  * kafka (Apache 2.0)
>>  * scalate (Apache 2.0)
>>  * leveldb jni (BSD)
>>
>> == Cryptography ==
>>
>> Samza will depend on secure Hadoop, which can optionally use Kerberos.
>>
>> == Required Resources ==
>>
>> === Mailing Lists ===
>>
>> samza-private for private PMC discussions (with moderated subscriptions)
>> samza-dev
>> samza-commits
>> samza-user
>>
>> === Subversion Directory ===
>>
>> Git is the preferred source control system: git://git.apache.org/samza
>>
>> === Issue Tracking ===
>>
>> JIRA Samza (SAMZA)
>>
>> === Other Resources ===
>>
>> The existing code already has unit tests, so we would like a Hudson
>> instance to run them whenever a new patch is submitted. This can be
>>added
>> after project creation.
>>
>> == Initial Committers ==
>>
>>  * Jay Kreps
>>  * Jakob Homan
>>  * Chris Riccomini
>>  * Sriram Subramanian
>>
>> == Affiliations ==
>>
>>  * Jay Kreps (LinkedIn)
>>  * Jakob Homan (LinkedIn)
>>  * Chris Riccomini (LinkedIn)
>>  * Sriram Subramanian (LinkedIn)
>>
>> == Sponsors ==
>>
>> === Champion ===
>>
>> Jakob Homan (Apache Member)
>>
>> === Nominated Mentors ===
>>
>>  * Arun C Murthy <acmurthy at apache dot org>
>>  * Chris Douglas <cdouglas at apache dot org>
>>  * Roman Shaposhnik <rvs at apache dot org>
>>
>> === Sponsoring Entity ===
>>
>> We are requesting the Incubator to sponsor this project.
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message