incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gavin Li <lyo.ga...@gmail.com>
Subject [PROPOSAL]Pistachio
Date Thu, 18 Jun 2015 17:17:22 GMT
Hi,

I want to propose project Pistachio to enter Apache Incubator.

Below please find the proposal.

Thanks,
Gavin Li



= Pistachio =

== Abstract ==

Pistachio is a fault-tolerant low latency distributed storage system which
enables simple embedding the computation to the storage layer to achieve
best data locality. It evolves from Yahoo’s global user profile storage
system.

== Proposal ==

Pistachio is a distributed key value store system with fault tolerance and
consistency guarantee. It supports multiple local storage engine including
in-memory, kyoto cabinet, rocks DB etc. Pistachio is being used as the user
profile storage for massive scale global ads products in Yahoo storing 10+
billion user profiles. The performance and reliability has been well proven
on production.

Pistachio can easily embed computation to the storage layer to achieve the
best data locality to improve the computation performance significantly
which is an innovative model comparing with the normal ways where the
storage and compute are independent to each other.

== Background ==

Pistachio is originally designed and optimized for Yahoo’s large scale
global open RTB(real-time bidding) use cases where latency is critical(the
whole request needs to be finished within 100ms including network round
trips). It stores 10+ billion user profiles in 8 data centers.

Then because of the great performance and the flexibility of local storage
choices, we evolved it to do distributed compute. Rich call back interfaces
are added to supports easy compute directly on top of the storage system
local to the data partition. This model is totally different from the
traditional distributed computation model where the storage and compute are
separated and independent. In the new model we found data locality can be
improved significantly and lots of data access round trips can be reduced
in computation, and the performance can be improved significantly.

It was publicly announced in April 2015 and currently being hosted in
Github.

== Rationale ==

As a key value store system Pistachio is unique in terms of low latency
access with fault tolerance and consistency guarantee. The reliability,
scalability, fault tolerance and performance has been well proven in global
large scale revenue supporting production system in Yahoo.

As a distributed computation system, it’s an innovative model where the
compute layer is introduced on top of the storage layer natively and
naturally to optimize the data locality of computation.

Operating the project in “apache way” greatly aligns with the long-term
vision of this project and can greatly help the development of the
community.

== Current Status ==

Pistachio was open-sourced and announced in April 2015 and currently being
hosted in Github, it was mainly being developed by the team from Yahoo and
already attracted lots of external developers (20+ watches and forks on
github).

== Meritocracy ==

We plan to build an environment following the Apache meritocracy
principles. Many companies including Linkedin, GF securities, Microsoft and
open source communities like deeplearning4j have already expressed
interests or accepted the invitations to participate in this project.

== Community ==

Since the announcement of Pistachio we received lots of interests. And the
concept of embedding computation to storage also got lots of recognitions.
We also started to work with other communities like deeplearning4j to build
more application use cases with Pistachio. We believe the community will
grow fast.

== Core Developers ==

This project is created by Gavin Li. Core developers are currently mainly
in Yahoo.

== Alignment ==

Pistachio depends on many Apache projects and dependencies including Kafka,
Helix, Zookeeper, Curator, Apache Commons, etc.

== Known Risks ==

=== Orphaned Products ===

The risk of Pistachio being orphaned is small because Yahoo heavily
invested in this system. It’s the internal storage standard for Yahoo’s
global ads products and still being expanded. Migration cost from this
project is very high. We are also working with external communities like
deeplearning4j and other companies to expand the applications.

=== Inexperience with Open Source ===

Core developers are experienced open source contributors in many projects
including Druid, Spark, Storm, etc. Pistachio committers will be guided by
the mentors with strong Apache open source project backgrounds.

=== Homogeneous Developers ===

The initial committers include developers from several institutions
including Microsoft, GF Securities, Linkedin and Yahoo.

=== Reliance on Salaried Developers ===

We work on Pistachio on both salaried time and after hours. Many developers
from other institutions already accepted the invitation to volunteer
working on Pistachio.

=== Relationships with Other Apache Products ===

As mentioned earlier, Pistachio depends on apache kafka, helix, zookeeper,
curator, etc.

=== A Excessive Fascination with the Apache Brand ===

Generating publicity is not the purpose of this proposal. We mainly want to
join the ASF in order to increase our contacts and visibility in the open
source world to attract great developers.

== Document ==

Current documentation can be found here: https://github.com/yahoo/Pistachio.

== Initial source ==

Initial source can be found here in the Github repo:
https://github.com/yahoo/Pistachio.

== External dependencies ==

To the best of our knowledge, here is the list of dependencies:
Rocks DB
ICU4j
Apache Curator
netty
google http client
codahale.metrics
apache helix
apache zookeeper
apache commons
apache thrift
apache kafka
kyoto cabinet (GNU GPL)
google protocol buffer
kryo
slf4j

To the best of our knowledge, except kyoto cabinet others are all
distributed under Apache compatible licenses:
BSD
ICU
Apache License 2.0
MIT

Kytoto cabinet is under GNU GPL, but it is not a hard necessary dependency
to Pistachio, it’s an optional pluggable storage engine. It’s designed in
the way that it’s totally plugable and very loosely coupled. We can easily
remove it in graduation.

== Required Resources ==

Mailing Lists

pistachio-user
pistachio-dev
pistachio-commits
pistachio-private (for private PMC discussions)

Git

The Pistachio team prefers Git for source version control: git://
git.apache.org/pistachio

Issue Tracking

JIRA Pistachio (PISTACHIO)

Other Resources

Jenkins continuous integration testing

== Initial Committers ==

Gavin Li <lyo.gavin at gmail dot com>
Lie Yang <lyang at yahoo-inc dot com>
Jay Kim <pitecus at yahoo-inc dot com>
Flavio Junqueira <fpj at apache dot org>
Chihong Liang<chihong.liang at gmail dot com>
Yong Liu<ly7110 at gmail dot com>
Shengwu Yang <yangshengwu at gmail dot com>

== Affiliations ==

Gavin Li - Yahoo
Flavio Junqueira - Microsoft
Chihong Liang - GF securities
Yong Liu - Yingmi Asset Management Corp.
Lie Yang - Yahoo
Jay Kim - Yahoo
Shengwu Yang - Linkedin China

== Sponsors ==

=== Champion ===

Flavio Junqueira <fpj at apache dot org>

=== Nominated Mentors ===

=== Sponsoring Entity ===

The Apache Incubator

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message