incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Saputra <henry.sapu...@gmail.com>
Subject Re: [DISCUSS] Proposing Griffin for Apache incubator
Date Thu, 24 Nov 2016 01:03:43 GMT
Hi John,

We have added this comment in the proposal:

"
The initial committers are employees of eBay Inc.
"

- Henry

On Wed, Nov 23, 2016 at 4:50 PM, John D. Ament <johndament@apache.org>
wrote:

> Henry,
>
> Can you add initial committer affiliations to the proposal?
>
> John
>
> On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <henry.saputra@gmail.com>
> wrote:
>
> > Hi All,
> >
> > As the champion for Griffin, I would like to bring up discussion to
> > bring the project as Apache incubator podling.
> >
> > Here is the direct quote from the abstract:
> >
> > "
> > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > Apache Spark. It provides a framework process for defining data
> > quality model, executing data quality measurement, automating data
> > profiling and validation, as well as a unified data quality
> > visualization across multiple data systems. It tries to address the
> > data quality challenges in big data and streaming context.
> > "
> >
> > Here is the link to the proposal:
> > https://wiki.apache.org/incubator/GriffinProposal
> >
> > I have copied the proposal below for easy access
> >
> >
> > Thanks,
> >
> > - Henry
> >
> >
> > Griffin Proposal
> >
> > Abstract
> >
> > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > Apache Spark. It provides a framework process for defining data
> > quality model, executing data quality measurement, automating data
> > profiling and validation, as well as a unified data quality
> > visualization across multiple data systems. It tries to address the
> > data quality challenges in big data and streaming context.
> >
> > Proposal
> >
> > Griffin is a open source Data Quality solution for distributed data
> > systems at any scale in both streaming or batch data context. When
> > people use open source products (e.g. Apache Hadoop, Apache Spark,
> > Apache Kafka, Apache Storm), they always need a data quality service
> > to build his/her confidence on data quality processed by those
> > platforms. Griffin creates a unified process to define and construct
> > data quality measurement pipeline across multiple data systems to
> > provide:
> >
> > Automatic quality validation of the data
> > Data profiling and anomaly detection
> > Data quality lineage from upstream to downstream data systems.
> > Data quality health monitoring visualization
> > Shared infrastructure resource management
> >
> > Overview of Griffin
> >
> > Griffin has been deployed in production at eBay serving major data
> > systems, it takes a platform approach to provide generic features to
> > solve common data quality validation pain points. Firstly, user can
> > register the data asset which user wants to do data quality check. The
> > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> > system or near real-time streaming data from Apache Kafka, Apache
> > Storm and other real time data platforms. Secondly, user can create
> > data quality model to define the data quality rule and metadata.
> > Thirdly, the model or rule will be executed automatically (by the
> > model engine) to get the sample data quality validation results in a
> > few seconds for streaming data. Finally, user can analyze the data
> > quality results through built-in visualization tool to take actions.
> >
> > Griffin includes:
> >
> > Data Quality Model Engine
> >
> > Griffin is model driven solution, user can choose various data quality
> > dimension to execute his/her data quality validation based on selected
> > target data-set or source data-set ( as the golden reference data). It
> > has a corresponding library supporting it in back-end for the
> > following measurement:
> >
> > Accuracy - Does data reflect the real-world objects or a verifiable
> source
> > Completeness - Is all necessary data present
> > Validity - Are all data values within the data domains specified by the
> > business
> > Timeliness - Is the data available at the time needed
> > Anomaly detection - Pre-built algorithm functions for the
> > identification of items, events or observations which do not conform
> > to an expected pattern or other items in a dataset
> > Data Profiling - Apply statistical analysis and assessment of data
> > values within a dataset for consistency, uniqueness and logic.
> >
> > Data Collection Layer
> >
> > We support two kinds of data sources, batch data and real time data.
> >
> > For batch mode, we can collect data source from Apache Hadoop based
> > platform by various data connectors.
> >
> > For real time mode, we can connect with messaging system like Kafka to
> > near real time analysis.
> >
> > Data Process and Storage Layer
> >
> > For batch analysis, our data quality model will compute data quality
> > metrics in our spark cluster based on data source in Apache Hadoop.
> >
> > For near real time analysis, we consume data from messaging system,
> > then our data quality model will compute our real time data quality
> > metrics in our spark cluster. for data storage, we use time series
> > database in our back end to fulfill front end request.
> >
> > Griffin Service
> >
> > We have RESTful web services to accomplish all the functionalities of
> > Griffin, such as register data asset, create data quality model,
> > publish metrics, retrieve metrics, add subscription, etc. So, the
> > developers can develop their own user interface based on these web
> > services.
> >
> > Background
> >
> > At eBay, when people play with big data in Apache Hadoop (or other
> > streaming data), data quality often becomes one big challenge.
> > Different teams have built customized data quality tools to detect and
> > analyze data quality issues within their own domain. We are thinking
> > to take a platform approach to provide shared Infrastructure and
> > generic features to solve common data quality pain points. This would
> > enable us to build trusted data assets.
> >
> > Currently it’s very difficult and costly to do data quality validation
> > when we have big data flow across multi-platforms at eBay (e.g.
> > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> > MongoDB). Take eBay real time personalization platform as an example.
> > Every day we have to validate data quality status for ~600M records (
> > imagine we have 150M active users for our website). Data quality often
> > becomes one big challenge both in its streaming and batch pipelines.
> >
> > So we conclude 3 data quality problems at eBay:
> >
> > Lack of end2end unified view of data quality measurement from multiple
> > data sources to target applications, it usually takes a long time to
> > identify and fix poor data quality.
> > How to get data quality measured in streaming mode, we need to have a
> > process and tool to visualize data quality insights through
> > registering dataset which you want to check data quality, creating
> > data quality measurement model, executing the data quality validation
> > job and getting metrics insights for action taking.
> > No Shared platform and API Service, have to apply and manage own
> > hardware and software infrastructure.
> >
> > Rationale
> >
> > The challenge we face at eBay is that our data volume is becoming
> > bigger and bigger, system processes become more complex, while we do
> > not have a unified data quality solution to ensure the trusted data
> > sets which provide confidences on data quality to our data consumers.
> > The key challenges on data quality includes:
> >
> > Existing commercial data quality solution cannot address data quality
> > lineage among systems, cannot scale out to support fast growing data
> > at eBay
> > Existing eBay's domain specific tools take a long time to identify and
> > fix poor data quality when data flowed through multiple systems
> > Business logic becomes complex, requires data quality system much
> flexible.
> >
> > Some data quality issues do have business impact on user experiences,
> > revenue, efficiency & compliance.
> >
> > Communication overhead of data quality metrics, typically in a big
> > organization, which involve different teams.
> >
> > The idea of Griffin is to provide Data Quality validation as a
> > Service, to allow data engineers and data consumers to have:
> >
> > Near real-time understanding of the data quality health of your data
> > pipelines with end-to-end monitoring, all in one place.
> > Profiling, detecting and correlating issues and providing
> > recommendations that drive rapid and focused troubleshooting
> > A centralized data quality model management system including rule,
> > metadata, scheduler etc.
> > Native code generation to run everywhere, including Hadoop, Kafka, Spark,
> > etc.
> > One set of tools to build data quality pipelines across all eBay data
> > platforms.
> >
> > Current Status
> >
> > Meritocracy
> >
> > Griffin has been deployed in production at eBay and provided the
> > centralized data quality service for several eBay systems ( for
> > example, real time personalization platform, eBay real time ID linking
> > platform, Hadoop datasets, Site speed analytics platform). Our aim is
> > to build a diverse developer and user community following the Apache
> > meritocracy model. We will encourage contributions and participation
> > of all types of work, and ensure that contributors are appropriately
> > recognized.
> >
> > Community
> >
> > Currently the project is being developed at eBay. It's only for eBay
> > internal community. Griffin seeks to develop the developer and user
> > communities during incubation. We believe it will grow substantially
> > by becoming an Apache project.
> >
> > Core Developers
> >
> > Griffin is currently being designed and developed by engineers from
> > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> > All of these core developers have deep expertise in Apache Hadoop and
> > the Hadoop Ecosystem in general.
> >
> > Alignment
> >
> > The ASF is a natural host for Griffin given that it is already the
> > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> > emerging big data products. Those are requiring data quality solution
> > by nature to ensure the data quality which they processed. When people
> > use open source data technology, the big question to them is that how
> > we can ensure the data quality in it. Griffin leverages lot of Apache
> > open-source products. Griffin was designed to enable real time
> > insights into data quality validation by shared Infrastructure and
> > generic features to solve common data quality pain points.
> >
> > Known Risks
> >
> > Orphaned Products
> >
> > The core developers of Griffin team work full time on this project.
> > There is no risk of Griffin getting orphaned since at least one large
> > company (eBay) is extensively using it in their production Hadoop and
> > Spark clusters for multiple data systems. For example, currently there
> > are 4 data systems at eBay (real time personalization platform, eBay
> > real time ID linking platform, Hadoop, Site speed analytics platform)
> > are leveraging Griffin, with more than ~600M records for data quality
> > status validation every day, 35 data sets being monitored, 50+ data
> > quality models have been created.
> >
> > As Griffin is designed to connect many types of data sources, we are
> > very confident that they will use Griffin as a service for ensuring
> > the data quality in open source data ecosystems. We plan to extend and
> > diversify this community further through Apache.
> >
> > Inexperience with Open Source
> >
> > Griffin's core engineers are all active users and followers of open
> > source projects. They are already committers and contributors to the
> > Griffin Github project. All have been involved with the source code
> > that has been released under an open source license, and several of
> > them also have experience developing code in an open source
> > environment. Though the core set of Developers do not have Apache Open
> > Source experience, there are plans to onboard individuals with Apache
> > open source experience on to the project.
> >
> > Homogenous Developers
> >
> > The core developers are from eBay. Apache Incubation process
> > encourages an open and diverse meritocratic community. Griffin intends
> > to make every possible effort to build a diverse, vibrant and involved
> > community. We are committed to recruiting additional committers from
> > other companies based on their contribution to the project.
> >
> > Reliance on Salaried Developers
> >
> > eBay invested in Griffin as a company-wide data quality service
> > platform and some of its key engineers are working full time on the
> > project. they are all paid by eBay. We look forward to other Apache
> > developers and researchers to contribute to the project.
> >
> > Relationships with Other Apache Products
> >
> > Griffin has a strong relationship and dependency with Apache Hadoop,
> > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> > Hive. In addition, since there is a growing need for data quality
> > solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> > being part of Apache’s Incubation community, could help with a closer
> > collaboration among these four projects and as well as others.
> >
> > Documentation
> >
> > Information about Griffin can be found at https://github.com/eBay/
> griffin
> >
> > Initial Source
> >
> > Griffin has been under development since early 2016 by a team of
> > engineers at eBay Inc. It is currently hosted on Github.com under an
> > Apache license 2.0 at https://github.com/eBay/griffin . Once in
> > incubation we will be moving the code base to apache git library.
> >
> > External Dependencies
> >
> > Griffin has the following external dependencies.
> >
> > Basic
> >
> > JDK 1.7+
> > Scala
> > Apache Maven
> > JUnit
> > Log4j
> > Slf4j
> > Apache Commons
> >
> > Hadoop
> >
> > Apache Hadoop
> > Apache HBase
> > Apache Hive
> >
> > DB
> >
> > InfluxData
> >
> > Apache Spark
> >
> > Spark Core Library
> >
> > REST Service
> >
> > Jersey
> > Spring MVC
> >
> > Web frontend
> >
> > AngularJS
> > jQuery
> > Bootstrap
> > RequireJS
> > eCharts
> > Font Awesome
> >
> > Cryptography
> >
> > Currently there's no cryptography in Griffin.
> >
> > Required Resources
> >
> > Mailing List
> >
> > We currently use eBay mail box to communicate, but we'd like to move
> > that to ASF maintained mailing lists.
> >
> > Current mailing list: ebay-griffin-devs@googlegroups.com
> >
> > Proposed ASF maintained lists:
> >
> > private@griffin.incubator.apache.org
> >
> > dev@griffin.incubator.apache.org
> >
> > commits@griffin.incubator.apache.org
> >
> > Subversion Directory
> >
> > Git is the preferred source control system.
> >
> > Issue Tracking
> >
> > JIRA
> >
> > Other Resources
> >
> > The existing code already has unit tests so we will make use of
> > existing Apache continuous testing infrastructure. The resulting load
> > should not be very large.
> >
> > Initial Committers
> >
> > William Go
> > Alex Lv
> > Vincent Zhao
> > Shawn Sha
> > John Liu
> > Liang Shao
> >
> > Affiliations
> >
> > The initial committers are employees of eBay Inc.
> >
> > Sponsors
> >
> > Champion
> >
> > Henry Saputra (hsaputra@apache.org)
> >
> > Nominated Mentors
> >
> > Kasper Sørensen (kaspersor@apache.org)
> >
> > Uma Maheswara Rao Gangumalla (umamahesh@apache.org)
> >
> > Luciano Resende (luckbr1975@gmail.com)
> >
> > Sponsoring Entity
> >
> > We are requesting the Incubator to sponsor this project.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message