Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 54D7917446 for ; Mon, 29 Jun 2015 19:26:11 +0000 (UTC) Received: (qmail 38534 invoked by uid 500); 29 Jun 2015 19:26:10 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 38352 invoked by uid 500); 29 Jun 2015 19:26:10 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 38339 invoked by uid 99); 29 Jun 2015 19:26:10 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2015 19:26:10 +0000 Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id C173B1A0251 for ; Mon, 29 Jun 2015 19:26:09 +0000 (UTC) Received: by wgjx7 with SMTP id x7so76199552wgj.2 for ; Mon, 29 Jun 2015 12:26:08 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.181.25.234 with SMTP id it10mr25259204wid.41.1435605968574; Mon, 29 Jun 2015 12:26:08 -0700 (PDT) Received: by 10.28.6.131 with HTTP; Mon, 29 Jun 2015 12:26:08 -0700 (PDT) In-Reply-To: References: Date: Mon, 29 Jun 2015 21:26:08 +0200 Message-ID: Subject: Re: [PROPOSAL]Pistachio From: jan i To: "general@incubator.apache.org" Content-Type: multipart/alternative; boundary=001a1136cbd8cb5e800519ad0f23 --001a1136cbd8cb5e800519ad0f23 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi I can for sure follow the argument that different design ideas around a problem complex leads to different implementations. My concern is a little bit different. I assume that the developers are in general more interested in the problem complex than the design. If I am correct such projects will be competing for the same developer, and might find it hard to grow. I respect "internal competition" it can be very fruitful, we just need to make sure that we don=C2=B4t split a good community into smaller communities that are= too small to survive. just my little concern after having read the last couple of emails. rgds jan i. On 29 June 2015 at 20:53, Gavin Li wrote: > Hi Andrew, > > I agree with you. I've updated the proposal to include a little bit more > explanations about the difference with Hadoop. > > Purely pursuing novelty is never our interest. Instead I believe even for > the same problem different design and implementation ideas can make big > difference. I think that's why there are many "internal competitions" in > ASF. Having looked at other systems like Ignite and Geode I believe > Pistachio is still quite different in design and implementation when > solving some common problems like in-memory distributed storage and > co-locating computation and data. > > Thanks, > Gavin Li > > On Fri, Jun 26, 2015 at 12:07 PM, Andrew Purtell > wrote: > > > Thanks Gavin. > > > > Please let me suggest that novelty is not a requirement for incubation, > and > > a proposal doesn't need to make claims of novelty to be accepted. > > > > Should the proposal be accepted for incubation, you may find your new > > neighbors at Apache can do X where you weren't aware of it. It will be > > totally up to the new podling if you want to survey the landscape when > > figuring out how to differentiate, but I do recommend it, it may help y= ou > > crystallize a community around a real difference and advantage provided > by > > Pistachio. > > > > > > On Mon, Jun 22, 2015 at 7:54 PM, Gavin Li wrote: > > > > > Hi Andrew, > > > > > > As we described more in > > > > > > > > > http://yahooeng.tumblr.com/post/116291838351/pistachio-co-locate-the-data= -and-compute-for > > > , > > > a very common problem we saw in Hadoop use cases is we often need to > > > persist the previous result of one map reduce job onto HDFS, then the > > next > > > day we process the new data together with the previous result. Usuall= y > > the > > > most expensive part is the shuffling part where we need to join the > > > previous data and the new data together. It's so expensive because HD= FS > > > doesn't store the data in a partitioned way. So data have to be > > transferred > > > again and again in the shuffling phase. Instead, in Pistachio we do t= he > > > computation right on top of the partitioned storage layer, so that th= e > > > previous result is always stored in a partitioned way, so shuffling c= an > > be > > > avoided. Expensive IO and roundtrips can thus be avoided so that much > > > better performance can be achieved. > > > > > > The other difference is in Pistachio we can do computation based on > > > in-memory storage with data replication. Different from the in-memory > > > computation in Spark, the storage can be in-memory here. > > > > > > Please let me know if I'm not clear enough. > > > > > > Thanks, > > > Gavin Li > > > > > > On Mon, Jun 22, 2015 at 7:53 PM, Andrew Purtell > > > wrote: > > > > > > > It was a simple question, and not meant to suggest anything one way > or > > > > other regarding my opinion of this proposal. > > > > > > > > On Monday, June 22, 2015, John D. Ament > wrote: > > > > > > > > > On Mon, Jun 22, 2015 at 10:26 PM Andrew Purtell < > apurtell@apache.org > > > > > > wrote: > > > > > > > > > > > > Pistachio can easily embed computation to the storage layer t= o > > > > achieve > > > > > > the > > > > > > > best data locality to improve the computation performance > > > > significantly > > > > > > > which is an innovative model comparing with the normal ways > where > > > the > > > > > > > storage and compute are independent to each other. > > > > > > > > > > > > Have you heard of something called Hadoop? > > > > > > > > > > > > > > > > Regardless of whether he has or not - what's your point? The ASF > has > > > > > historically not denied the entry of new projects just because > their > > > > domain > > > > > intersects with another project's. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 18, 2015 at 10:17 AM, Gavin Li > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I want to propose project Pistachio to enter Apache Incubator= . > > > > > > > > > > > > > > Below please find the proposal. > > > > > > > > > > > > > > Thanks, > > > > > > > Gavin Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > =3D Pistachio =3D > > > > > > > > > > > > > > =3D=3D Abstract =3D=3D > > > > > > > > > > > > > > Pistachio is a fault-tolerant low latency distributed storage > > > system > > > > > > which > > > > > > > enables simple embedding the computation to the storage layer > to > > > > > achieve > > > > > > > best data locality. It evolves from Yahoo=E2=80=99s global us= er profile > > > > storage > > > > > > > system. > > > > > > > > > > > > > > =3D=3D Proposal =3D=3D > > > > > > > > > > > > > > Pistachio is a distributed key value store system with fault > > > > tolerance > > > > > > and > > > > > > > consistency guarantee. It supports multiple local storage > engine > > > > > > including > > > > > > > in-memory, kyoto cabinet, rocks DB etc. Pistachio is being us= ed > > as > > > > the > > > > > > user > > > > > > > profile storage for massive scale global ads products in Yaho= o > > > > storing > > > > > > 10+ > > > > > > > billion user profiles. The performance and reliability has be= en > > > well > > > > > > proven > > > > > > > on production. > > > > > > > > > > > > > > Pistachio can easily embed computation to the storage layer t= o > > > > achieve > > > > > > the > > > > > > > best data locality to improve the computation performance > > > > significantly > > > > > > > which is an innovative model comparing with the normal ways > where > > > the > > > > > > > storage and compute are independent to each other. > > > > > > > > > > > > > > =3D=3D Background =3D=3D > > > > > > > > > > > > > > Pistachio is originally designed and optimized for Yahoo=E2= =80=99s > large > > > > scale > > > > > > > global open RTB(real-time bidding) use cases where latency is > > > > > > critical(the > > > > > > > whole request needs to be finished within 100ms including > network > > > > round > > > > > > > trips). It stores 10+ billion user profiles in 8 data centers= . > > > > > > > > > > > > > > Then because of the great performance and the flexibility of > > local > > > > > > storage > > > > > > > choices, we evolved it to do distributed compute. Rich call > back > > > > > > interfaces > > > > > > > are added to supports easy compute directly on top of the > storage > > > > > system > > > > > > > local to the data partition. This model is totally different > from > > > the > > > > > > > traditional distributed computation model where the storage a= nd > > > > compute > > > > > > are > > > > > > > separated and independent. In the new model we found data > > locality > > > > can > > > > > be > > > > > > > improved significantly and lots of data access round trips ca= n > be > > > > > reduced > > > > > > > in computation, and the performance can be improved > > significantly. > > > > > > > > > > > > > > It was publicly announced in April 2015 and currently being > > hosted > > > in > > > > > > > Github. > > > > > > > > > > > > > > =3D=3D Rationale =3D=3D > > > > > > > > > > > > > > As a key value store system Pistachio is unique in terms of l= ow > > > > latency > > > > > > > access with fault tolerance and consistency guarantee. The > > > > reliability, > > > > > > > scalability, fault tolerance and performance has been well > proven > > > in > > > > > > global > > > > > > > large scale revenue supporting production system in Yahoo. > > > > > > > > > > > > > > As a distributed computation system, it=E2=80=99s an innovati= ve model > > where > > > > the > > > > > > > compute layer is introduced on top of the storage layer > natively > > > and > > > > > > > naturally to optimize the data locality of computation. > > > > > > > > > > > > > > Operating the project in =E2=80=9Capache way=E2=80=9D greatly= aligns with the > > > > long-term > > > > > > > vision of this project and can greatly help the development o= f > > the > > > > > > > community. > > > > > > > > > > > > > > =3D=3D Current Status =3D=3D > > > > > > > > > > > > > > Pistachio was open-sourced and announced in April 2015 and > > > currently > > > > > > being > > > > > > > hosted in Github, it was mainly being developed by the team > from > > > > Yahoo > > > > > > and > > > > > > > already attracted lots of external developers (20+ watches an= d > > > forks > > > > on > > > > > > > github). > > > > > > > > > > > > > > =3D=3D Meritocracy =3D=3D > > > > > > > > > > > > > > We plan to build an environment following the Apache > meritocracy > > > > > > > principles. Many companies including Linkedin, GF securities, > > > > Microsoft > > > > > > and > > > > > > > open source communities like deeplearning4j have already > > expressed > > > > > > > interests or accepted the invitations to participate in this > > > project. > > > > > > > > > > > > > > =3D=3D Community =3D=3D > > > > > > > > > > > > > > Since the announcement of Pistachio we received lots of > > interests. > > > > And > > > > > > the > > > > > > > concept of embedding computation to storage also got lots of > > > > > > recognitions. > > > > > > > We also started to work with other communities like > > deeplearning4j > > > to > > > > > > build > > > > > > > more application use cases with Pistachio. We believe the > > community > > > > > will > > > > > > > grow fast. > > > > > > > > > > > > > > =3D=3D Core Developers =3D=3D > > > > > > > > > > > > > > This project is created by Gavin Li. Core developers are > > currently > > > > > mainly > > > > > > > in Yahoo. > > > > > > > > > > > > > > =3D=3D Alignment =3D=3D > > > > > > > > > > > > > > Pistachio depends on many Apache projects and dependencies > > > including > > > > > > Kafka, > > > > > > > Helix, Zookeeper, Curator, Apache Commons, etc. > > > > > > > > > > > > > > =3D=3D Known Risks =3D=3D > > > > > > > > > > > > > > =3D=3D=3D Orphaned Products =3D=3D=3D > > > > > > > > > > > > > > The risk of Pistachio being orphaned is small because Yahoo > > heavily > > > > > > > invested in this system. It=E2=80=99s the internal storage st= andard for > > > > Yahoo=E2=80=99s > > > > > > > global ads products and still being expanded. Migration cost > from > > > > this > > > > > > > project is very high. We are also working with external > > communities > > > > > like > > > > > > > deeplearning4j and other companies to expand the applications= . > > > > > > > > > > > > > > =3D=3D=3D Inexperience with Open Source =3D=3D=3D > > > > > > > > > > > > > > Core developers are experienced open source contributors in > many > > > > > projects > > > > > > > including Druid, Spark, Storm, etc. Pistachio committers will > be > > > > guided > > > > > > by > > > > > > > the mentors with strong Apache open source project background= s. > > > > > > > > > > > > > > =3D=3D=3D Homogeneous Developers =3D=3D=3D > > > > > > > > > > > > > > The initial committers include developers from several > > institutions > > > > > > > including Microsoft, GF Securities, Linkedin and Yahoo. > > > > > > > > > > > > > > =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D > > > > > > > > > > > > > > We work on Pistachio on both salaried time and after hours. > Many > > > > > > developers > > > > > > > from other institutions already accepted the invitation to > > > volunteer > > > > > > > working on Pistachio. > > > > > > > > > > > > > > =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D > > > > > > > > > > > > > > As mentioned earlier, Pistachio depends on apache kafka, heli= x, > > > > > > zookeeper, > > > > > > > curator, etc. > > > > > > > > > > > > > > =3D=3D=3D A Excessive Fascination with the Apache Brand =3D= =3D=3D > > > > > > > > > > > > > > Generating publicity is not the purpose of this proposal. We > > mainly > > > > > want > > > > > > to > > > > > > > join the ASF in order to increase our contacts and visibility > in > > > the > > > > > open > > > > > > > source world to attract great developers. > > > > > > > > > > > > > > =3D=3D Document =3D=3D > > > > > > > > > > > > > > Current documentation can be found here: > > > > > > > https://github.com/yahoo/Pistachio. > > > > > > > > > > > > > > =3D=3D Initial source =3D=3D > > > > > > > > > > > > > > Initial source can be found here in the Github repo: > > > > > > > https://github.com/yahoo/Pistachio. > > > > > > > > > > > > > > =3D=3D External dependencies =3D=3D > > > > > > > > > > > > > > To the best of our knowledge, here is the list of dependencie= s: > > > > > > > Rocks DB > > > > > > > ICU4j > > > > > > > Apache Curator > > > > > > > netty > > > > > > > google http client > > > > > > > codahale.metrics > > > > > > > apache helix > > > > > > > apache zookeeper > > > > > > > apache commons > > > > > > > apache thrift > > > > > > > apache kafka > > > > > > > kyoto cabinet (GNU GPL) > > > > > > > google protocol buffer > > > > > > > kryo > > > > > > > slf4j > > > > > > > > > > > > > > To the best of our knowledge, except kyoto cabinet others are > all > > > > > > > distributed under Apache compatible licenses: > > > > > > > BSD > > > > > > > ICU > > > > > > > Apache License 2.0 > > > > > > > MIT > > > > > > > > > > > > > > Kytoto cabinet is under GNU GPL, but it is not a hard necessa= ry > > > > > > dependency > > > > > > > to Pistachio, it=E2=80=99s an optional pluggable storage engi= ne. It=E2=80=99s > > > > designed > > > > > in > > > > > > > the way that it=E2=80=99s totally plugable and very loosely c= oupled. We > > can > > > > > > easily > > > > > > > remove it in graduation. > > > > > > > > > > > > > > =3D=3D Required Resources =3D=3D > > > > > > > > > > > > > > Mailing Lists > > > > > > > > > > > > > > pistachio-user > > > > > > > pistachio-dev > > > > > > > pistachio-commits > > > > > > > pistachio-private (for private PMC discussions) > > > > > > > > > > > > > > Git > > > > > > > > > > > > > > The Pistachio team prefers Git for source version control: > git:// > > > > > > > git.apache.org/pistachio > > > > > > > > > > > > > > Issue Tracking > > > > > > > > > > > > > > JIRA Pistachio (PISTACHIO) > > > > > > > > > > > > > > Other Resources > > > > > > > > > > > > > > Jenkins continuous integration testing > > > > > > > > > > > > > > =3D=3D Initial Committers =3D=3D > > > > > > > > > > > > > > Gavin Li > > > > > > > Lie Yang > > > > > > > Jay Kim > > > > > > > Flavio Junqueira > > > > > > > Chihong Liang > > > > > > > Yong Liu > > > > > > > Shengwu Yang > > > > > > > > > > > > > > =3D=3D Affiliations =3D=3D > > > > > > > > > > > > > > Gavin Li - Yahoo > > > > > > > Flavio Junqueira - Microsoft > > > > > > > Chihong Liang - GF securities > > > > > > > Yong Liu - Yingmi Asset Management Corp. > > > > > > > Lie Yang - Yahoo > > > > > > > Jay Kim - Yahoo > > > > > > > Shengwu Yang - Linkedin China > > > > > > > > > > > > > > =3D=3D Sponsors =3D=3D > > > > > > > > > > > > > > =3D=3D=3D Champion =3D=3D=3D > > > > > > > > > > > > > > Flavio Junqueira > > > > > > > > > > > > > > =3D=3D=3D Nominated Mentors =3D=3D=3D > > > > > > > > > > > > > > =3D=3D=3D Sponsoring Entity =3D=3D=3D > > > > > > > > > > > > > > The Apache Incubator > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > > > > > > > - Andy > > > > > > > > > > > > Problems worthy of attack prove their worth by hitting back. - > Piet > > > > Hein > > > > > > (via Tom White) > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > > > > > - Andy > > > > > > > > Problems worthy of attack prove their worth by hitting back. - Piet > > Hein > > > > (via Tom White) > > > > > > > > > > > > > > > -- > > Best regards, > > > > - Andy > > > > Problems worthy of attack prove their worth by hitting back. - Piet Hei= n > > (via Tom White) > > > --001a1136cbd8cb5e800519ad0f23--