incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gavin Li <lyo.ga...@gmail.com>
Subject Re: [PROPOSAL]Pistachio
Date Mon, 29 Jun 2015 18:53:57 GMT
Hi Andrew,

I agree with you. I've updated the proposal to include a little bit more
explanations about the difference with Hadoop.

Purely pursuing novelty is never our interest. Instead I believe even for
the same problem different design and implementation ideas can make big
difference. I think that's why there are many "internal competitions" in
ASF. Having looked at other systems like Ignite and Geode I believe
Pistachio is still quite different in design and implementation when
solving some common problems like in-memory distributed storage and
co-locating computation and data.

Thanks,
Gavin Li

On Fri, Jun 26, 2015 at 12:07 PM, Andrew Purtell <apurtell@apache.org>
wrote:

> Thanks Gavin.
>
> Please let me suggest that novelty is not a requirement for incubation, and
> a proposal doesn't need to make claims of novelty to be accepted.
>
> Should the proposal be accepted for incubation, you may find your new
> neighbors at Apache can do X where you weren't aware of it. It will be
> totally up to the new podling if you want to survey the landscape when
> figuring out how to differentiate, but I do recommend it, it may help you
> crystallize a community around a real difference and advantage provided by
> Pistachio.
>
>
> On Mon, Jun 22, 2015 at 7:54 PM, Gavin Li <lyo.gavin@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > As we described more in
> >
> >
> http://yahooeng.tumblr.com/post/116291838351/pistachio-co-locate-the-data-and-compute-for
> > ,
> > a very common problem we saw in Hadoop use cases is we often need to
> > persist the previous result of one map reduce job onto HDFS, then the
> next
> > day we process the new data together with the previous result. Usually
> the
> > most expensive part is the shuffling part where we need to join the
> > previous data and the new data together. It's so expensive because HDFS
> > doesn't store the data in a partitioned way. So data have to be
> transferred
> > again and again in the shuffling phase. Instead, in Pistachio we do the
> > computation right on top of the partitioned storage layer, so that the
> > previous result is always stored in a partitioned way, so shuffling can
> be
> > avoided. Expensive IO and roundtrips can thus be avoided so that much
> > better performance can be achieved.
> >
> > The other difference is in Pistachio we can do computation based on
> > in-memory storage with data replication. Different from the in-memory
> > computation in Spark, the storage can be in-memory here.
> >
> > Please let me know if I'm not clear enough.
> >
> > Thanks,
> > Gavin Li
> >
> > On Mon, Jun 22, 2015 at 7:53 PM, Andrew Purtell <apurtell@apache.org>
> > wrote:
> >
> > > It was a simple question, and not meant to suggest anything one way or
> > > other regarding my opinion of this proposal.
> > >
> > > On Monday, June 22, 2015, John D. Ament <johndament@apache.org> wrote:
> > >
> > > > On Mon, Jun 22, 2015 at 10:26 PM Andrew Purtell <apurtell@apache.org
> > > > <javascript:;>> wrote:
> > > >
> > > > > > Pistachio can easily embed computation to the storage layer
to
> > > achieve
> > > > > the
> > > > > > best data locality to improve the computation performance
> > > significantly
> > > > > > which is an innovative model comparing with the normal ways
where
> > the
> > > > > > storage and compute are independent to each other.
> > > > >
> > > > > Have you heard of something called Hadoop?
> > > > >
> > > >
> > > > Regardless of whether he has or not - what's your point? The ASF has
> > > > historically not denied the entry of new projects just because their
> > > domain
> > > > intersects with another project's.
> > > >
> > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 18, 2015 at 10:17 AM, Gavin Li <lyo.gavin@gmail.com
> > > > <javascript:;>> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I want to propose project Pistachio to enter Apache Incubator.
> > > > > >
> > > > > > Below please find the proposal.
> > > > > >
> > > > > > Thanks,
> > > > > > Gavin Li
> > > > > >
> > > > > >
> > > > > >
> > > > > > = Pistachio =
> > > > > >
> > > > > > == Abstract ==
> > > > > >
> > > > > > Pistachio is a fault-tolerant low latency distributed storage
> > system
> > > > > which
> > > > > > enables simple embedding the computation to the storage layer
to
> > > > achieve
> > > > > > best data locality. It evolves from Yahoo’s global user profile
> > > storage
> > > > > > system.
> > > > > >
> > > > > > == Proposal ==
> > > > > >
> > > > > > Pistachio is a distributed key value store system with fault
> > > tolerance
> > > > > and
> > > > > > consistency guarantee. It supports multiple local storage engine
> > > > > including
> > > > > > in-memory, kyoto cabinet, rocks DB etc. Pistachio is being used
> as
> > > the
> > > > > user
> > > > > > profile storage for massive scale global ads products in Yahoo
> > > storing
> > > > > 10+
> > > > > > billion user profiles. The performance and reliability has been
> > well
> > > > > proven
> > > > > > on production.
> > > > > >
> > > > > > Pistachio can easily embed computation to the storage layer
to
> > > achieve
> > > > > the
> > > > > > best data locality to improve the computation performance
> > > significantly
> > > > > > which is an innovative model comparing with the normal ways
where
> > the
> > > > > > storage and compute are independent to each other.
> > > > > >
> > > > > > == Background ==
> > > > > >
> > > > > > Pistachio is originally designed and optimized for Yahoo’s
large
> > > scale
> > > > > > global open RTB(real-time bidding) use cases where latency is
> > > > > critical(the
> > > > > > whole request needs to be finished within 100ms including network
> > > round
> > > > > > trips). It stores 10+ billion user profiles in 8 data centers.
> > > > > >
> > > > > > Then because of the great performance and the flexibility of
> local
> > > > > storage
> > > > > > choices, we evolved it to do distributed compute. Rich call
back
> > > > > interfaces
> > > > > > are added to supports easy compute directly on top of the storage
> > > > system
> > > > > > local to the data partition. This model is totally different
from
> > the
> > > > > > traditional distributed computation model where the storage
and
> > > compute
> > > > > are
> > > > > > separated and independent. In the new model we found data
> locality
> > > can
> > > > be
> > > > > > improved significantly and lots of data access round trips can
be
> > > > reduced
> > > > > > in computation, and the performance can be improved
> significantly.
> > > > > >
> > > > > > It was publicly announced in April 2015 and currently being
> hosted
> > in
> > > > > > Github.
> > > > > >
> > > > > > == Rationale ==
> > > > > >
> > > > > > As a key value store system Pistachio is unique in terms of
low
> > > latency
> > > > > > access with fault tolerance and consistency guarantee. The
> > > reliability,
> > > > > > scalability, fault tolerance and performance has been well proven
> > in
> > > > > global
> > > > > > large scale revenue supporting production system in Yahoo.
> > > > > >
> > > > > > As a distributed computation system, it’s an innovative model
> where
> > > the
> > > > > > compute layer is introduced on top of the storage layer natively
> > and
> > > > > > naturally to optimize the data locality of computation.
> > > > > >
> > > > > > Operating the project in “apache way” greatly aligns with
the
> > > long-term
> > > > > > vision of this project and can greatly help the development
of
> the
> > > > > > community.
> > > > > >
> > > > > > == Current Status ==
> > > > > >
> > > > > > Pistachio was open-sourced and announced in April 2015 and
> > currently
> > > > > being
> > > > > > hosted in Github, it was mainly being developed by the team
from
> > > Yahoo
> > > > > and
> > > > > > already attracted lots of external developers (20+ watches and
> > forks
> > > on
> > > > > > github).
> > > > > >
> > > > > > == Meritocracy ==
> > > > > >
> > > > > > We plan to build an environment following the Apache meritocracy
> > > > > > principles. Many companies including Linkedin, GF securities,
> > > Microsoft
> > > > > and
> > > > > > open source communities like deeplearning4j have already
> expressed
> > > > > > interests or accepted the invitations to participate in this
> > project.
> > > > > >
> > > > > > == Community ==
> > > > > >
> > > > > > Since the announcement of Pistachio we received lots of
> interests.
> > > And
> > > > > the
> > > > > > concept of embedding computation to storage also got lots of
> > > > > recognitions.
> > > > > > We also started to work with other communities like
> deeplearning4j
> > to
> > > > > build
> > > > > > more application use cases with Pistachio. We believe the
> community
> > > > will
> > > > > > grow fast.
> > > > > >
> > > > > > == Core Developers ==
> > > > > >
> > > > > > This project is created by Gavin Li. Core developers are
> currently
> > > > mainly
> > > > > > in Yahoo.
> > > > > >
> > > > > > == Alignment ==
> > > > > >
> > > > > > Pistachio depends on many Apache projects and dependencies
> > including
> > > > > Kafka,
> > > > > > Helix, Zookeeper, Curator, Apache Commons, etc.
> > > > > >
> > > > > > == Known Risks ==
> > > > > >
> > > > > > === Orphaned Products ===
> > > > > >
> > > > > > The risk of Pistachio being orphaned is small because Yahoo
> heavily
> > > > > > invested in this system. It’s the internal storage standard
for
> > > Yahoo’s
> > > > > > global ads products and still being expanded. Migration cost
from
> > > this
> > > > > > project is very high. We are also working with external
> communities
> > > > like
> > > > > > deeplearning4j and other companies to expand the applications.
> > > > > >
> > > > > > === Inexperience with Open Source ===
> > > > > >
> > > > > > Core developers are experienced open source contributors in
many
> > > > projects
> > > > > > including Druid, Spark, Storm, etc. Pistachio committers will
be
> > > guided
> > > > > by
> > > > > > the mentors with strong Apache open source project backgrounds.
> > > > > >
> > > > > > === Homogeneous Developers ===
> > > > > >
> > > > > > The initial committers include developers from several
> institutions
> > > > > > including Microsoft, GF Securities, Linkedin and Yahoo.
> > > > > >
> > > > > > === Reliance on Salaried Developers ===
> > > > > >
> > > > > > We work on Pistachio on both salaried time and after hours.
Many
> > > > > developers
> > > > > > from other institutions already accepted the invitation to
> > volunteer
> > > > > > working on Pistachio.
> > > > > >
> > > > > > === Relationships with Other Apache Products ===
> > > > > >
> > > > > > As mentioned earlier, Pistachio depends on apache kafka, helix,
> > > > > zookeeper,
> > > > > > curator, etc.
> > > > > >
> > > > > > === A Excessive Fascination with the Apache Brand ===
> > > > > >
> > > > > > Generating publicity is not the purpose of this proposal. We
> mainly
> > > > want
> > > > > to
> > > > > > join the ASF in order to increase our contacts and visibility
in
> > the
> > > > open
> > > > > > source world to attract great developers.
> > > > > >
> > > > > > == Document ==
> > > > > >
> > > > > > Current documentation can be found here:
> > > > > > https://github.com/yahoo/Pistachio.
> > > > > >
> > > > > > == Initial source ==
> > > > > >
> > > > > > Initial source can be found here in the Github repo:
> > > > > > https://github.com/yahoo/Pistachio.
> > > > > >
> > > > > > == External dependencies ==
> > > > > >
> > > > > > To the best of our knowledge, here is the list of dependencies:
> > > > > > Rocks DB
> > > > > > ICU4j
> > > > > > Apache Curator
> > > > > > netty
> > > > > > google http client
> > > > > > codahale.metrics
> > > > > > apache helix
> > > > > > apache zookeeper
> > > > > > apache commons
> > > > > > apache thrift
> > > > > > apache kafka
> > > > > > kyoto cabinet (GNU GPL)
> > > > > > google protocol buffer
> > > > > > kryo
> > > > > > slf4j
> > > > > >
> > > > > > To the best of our knowledge, except kyoto cabinet others are
all
> > > > > > distributed under Apache compatible licenses:
> > > > > > BSD
> > > > > > ICU
> > > > > > Apache License 2.0
> > > > > > MIT
> > > > > >
> > > > > > Kytoto cabinet is under GNU GPL, but it is not a hard necessary
> > > > > dependency
> > > > > > to Pistachio, it’s an optional pluggable storage engine. It’s
> > > designed
> > > > in
> > > > > > the way that it’s totally plugable and very loosely coupled.
We
> can
> > > > > easily
> > > > > > remove it in graduation.
> > > > > >
> > > > > > == Required Resources ==
> > > > > >
> > > > > > Mailing Lists
> > > > > >
> > > > > > pistachio-user
> > > > > > pistachio-dev
> > > > > > pistachio-commits
> > > > > > pistachio-private (for private PMC discussions)
> > > > > >
> > > > > > Git
> > > > > >
> > > > > > The Pistachio team prefers Git for source version control: git://
> > > > > > git.apache.org/pistachio
> > > > > >
> > > > > > Issue Tracking
> > > > > >
> > > > > > JIRA Pistachio (PISTACHIO)
> > > > > >
> > > > > > Other Resources
> > > > > >
> > > > > > Jenkins continuous integration testing
> > > > > >
> > > > > > == Initial Committers ==
> > > > > >
> > > > > > Gavin Li <lyo.gavin at gmail dot com>
> > > > > > Lie Yang <lyang at yahoo-inc dot com>
> > > > > > Jay Kim <pitecus at yahoo-inc dot com>
> > > > > > Flavio Junqueira <fpj at apache dot org>
> > > > > > Chihong Liang<chihong.liang at gmail dot com>
> > > > > > Yong Liu<ly7110 at gmail dot com>
> > > > > > Shengwu Yang <yangshengwu at gmail dot com>
> > > > > >
> > > > > > == Affiliations ==
> > > > > >
> > > > > > Gavin Li - Yahoo
> > > > > > Flavio Junqueira - Microsoft
> > > > > > Chihong Liang - GF securities
> > > > > > Yong Liu - Yingmi Asset Management Corp.
> > > > > > Lie Yang - Yahoo
> > > > > > Jay Kim - Yahoo
> > > > > > Shengwu Yang - Linkedin China
> > > > > >
> > > > > > == Sponsors ==
> > > > > >
> > > > > > === Champion ===
> > > > > >
> > > > > > Flavio Junqueira <fpj at apache dot org>
> > > > > >
> > > > > > === Nominated Mentors ===
> > > > > >
> > > > > > === Sponsoring Entity ===
> > > > > >
> > > > > > The Apache Incubator
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein
> > > > > (via Tom White)
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message