Mailing-List: contact cvs-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@incubator.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Fri, 18 Nov 2016 21:12:49 -0000
Message-ID: <147950356990.30572.12807797323529397455@moin-vm.apache.org>
Subject: =?utf-8?q?=5BIncubator_Wiki=5D_Update_of_=22GriffinProposal=22_by_alexlv?=
Auto-Submitted: auto-generated
archived-at: Fri, 18 Nov 2016 21:13:14 -0000

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for=
 change notification.

The "GriffinProposal" page has been changed by alexlv:
https://wiki.apache.org/incubator/GriffinProposal?action=3Ddiff&rev1=3D1&re=
v2=3D2

   * Data quality health monitoring visualization
   * Shared infrastructure resource management
  =

+ =3D=3D Overview of Griffin =3D=3D
+ Griffin has been deployed in production at eBay serving major data system=
s, it takes a platform approach to provide generic features to solve common=
 data quality validation pain points. Firstly, user can register the data a=
sset which user wants to do data quality check. The data asset can be batch=
 data in RDBMS (e.g.Teradata), Apache Hadoop system or near real-time strea=
ming data from Apache Kafka, Apache Storm and other real time data platform=
s. Secondly, user can create data quality model to define the data quality =
rule and metadata. Thirdly, the model or rule will be executed automaticall=
y (by the model engine) to get the sample data quality validation results i=
n a few seconds for streaming data. Finally, user can analyze the data qual=
ity results through built-in visualization tool to take actions.
+ =

+ Griffin includes:
+ =

+ '''Data Quality Model Engine''' =

+ =

+ Griffin is model driven solution, user can choose various data quality di=
mension to execute his/her data quality validation based on selected target=
 data-set or source data-set ( as the golden reference data). It has a corr=
esponding library supporting it in back-end for the following measurement:
+  * Accuracy - Does data reflect the real-world objects or a verifiable so=
urce
+  * Completeness - Is all necessary data present
+  * Validity - Are all data values within the data domains specified by th=
e business
+  * Timeliness - Is the data available at the time needed
+  * Anomaly detection - Pre-built algorithm functions for the identificati=
on of items, events or observations which do not conform to an expected pat=
tern or other items in a dataset
+  * Data Profiling - Apply statistical analysis and assessment of data val=
ues within a dataset for consistency, uniqueness and logic.
+ =

+ '''Data Collection Layer'''
+ =

+ We support two kinds of data sources, batch data and real time data.
+ =

+ For batch mode, we can collect data source from Apache Hadoop based platf=
orm by various data connectors.
+ =

+ For real time mode, we can connect with messaging system like Kafka to ne=
ar real time analysis.
+ =

+ '''Data Process and Storage Layer'''
+ =

+ For batch analysis, our data quality model will compute data quality metr=
ics in our spark cluster based on data source in Apache Hadoop.
+ =

+ For near real time analysis, we consume data from messaging system, then =
our data quality model will compute our real time data quality metrics in o=
ur spark cluster. for data storage, we use time series database in our back=
 end to fulfill front end request.
+ =

+ '''Griffin Service'''
+ =

+ We have RESTful web services to accomplish all the functionalities of Gri=
ffin, such as register data asset, create data quality model, publish metri=
cs, retrieve metrics, add subscription, etc. So, the developers can develop=
 their own user interface based on these web services.
+ =

+ =3D=3D Background =3D=3D
+ At eBay, when people play with big data in Apache Hadoop (or other stream=
ing data), data quality often becomes one big challenge. Different teams ha=
ve built customized data quality tools to detect and analyze data quality i=
ssues within their own domain. We are thinking to take a platform approach =
to provide shared Infrastructure and generic features to solve common data =
quality pain points. This would enable us to build trusted data assets.
+ =

+ Currently it=E2=80=99s very difficult and costly to do data quality valid=
ation when we have big data flow across multi-platforms at eBay (e.g. Oracl=
e, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka, MongoDB). Take=
 eBay real time personalization platform as an example. Every day we have t=
o validate data quality status for ~600M records ( imagine we have 150M act=
ive users for our website). Data quality often becomes one big challenge bo=
th in its streaming and batch pipelines.
+ =

+ So we conclude 3 data quality problems at eBay:
+ =

+  1. Lack of end2end unified view of data quality measurement from multipl=
e data sources to target applications, it usually takes a long time to iden=
tify and fix poor data quality.
+  2. How to get data quality measured in streaming mode, we need to have a=
 process and tool to visualize data quality insights through registering da=
taset which you want to check data quality, creating data quality measureme=
nt model, executing the data quality validation job and getting metrics ins=
ights for action taking.
+  3. No Shared platform and API Service, have to apply and manage own hard=
ware and software infrastructure.
+ =

+ =3D=3D Rationale =3D=3D
+ The challenge we face at eBay is that our data volume is becoming bigger =
and bigger, system processes become more complex, while we do not have a un=
ified data quality solution to ensure the trusted data sets which provide c=
onfidences on data quality to our data consumers. The key challenges on dat=
a quality includes:
+ =

+  1. Existing commercial data quality solution cannot address data quality=
 lineage among systems, cannot scale out to support fast growing data at eB=
ay
+  2. Existing eBay's domain specific tools take a long time to identify an=
d fix poor data quality when data flowed through multiple systems
+  3. Business logic becomes complex, requires data quality system much fle=
xible.
+  4. Some data quality issues do have business impact on user experiences,=
 revenue, efficiency & compliance.
+  5. Communication overhead of data quality metrics, typically in a big or=
ganization, which involve different teams.
+ =

+ The idea of Griffin is to provide Data Quality validation as a Service, t=
o allow data engineers and data consumers to have:
+ =

+  * Near real-time understanding of the data quality health of your data p=
ipelines with end-to-end monitoring, all in one place.
+  * Profiling, detecting and correlating issues and providing recommendati=
ons that drive rapid and focused troubleshooting
+  * A centralized data quality model management system including rule, met=
adata, scheduler etc.
+  * Native code generation to run everywhere, including Hadoop, Kafka, Spa=
rk, etc.
+  * One set of tools to build data quality pipelines across all eBay data =
platforms.
+ =

+ =3D=3D Current Status =3D=3D
+ =3D=3D=3D Meritocracy =3D=3D=3D
+ Griffin has been deployed in production at eBay and provided the centrali=
zed data quality service for several eBay systems ( for example, real time =
personalization platform, eBay real time ID linking platform, Hadoop datase=
ts, Site speed analytics platform). Our aim is to build a diverse developer=
 and user community following the Apache meritocracy model. We will encoura=
ge contributions and participation of all types of work, and ensure that co=
ntributors are appropriately recognized.
+ =

+ =3D=3D=3D Community =3D=3D=3D
+ Currently the project is being developed at eBay. It's only for eBay inte=
rnal community. Griffin seeks to develop the developer and user communities=
 during incubation. We believe it will grow substantially by becoming an Ap=
ache project.
+ =

+ =3D=3D=3D Core Developers =3D=3D=3D
+ Griffin is currently being designed and developed by engineers from eBay =
Inc. =E2=80=93 William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu. All=
 of these core developers have deep expertise in Apache Hadoop and the Hado=
op Ecosystem in general.
+ =

+ =3D=3D=3D Alignment =3D=3D=3D
+ The ASF is a natural host for Griffin given that it is already the home o=
f Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other emerging big dat=
a products. Those are requiring data quality solution by nature to ensure t=
he data quality which they processed. When people use open source data tech=
nology, the big question to them is that how we can ensure the data quality=
 in it. Griffin leverages lot of Apache open-source products. Griffin was d=
esigned to enable real time insights into data quality validation by shared=
 Infrastructure and generic features to solve common data quality pain poin=
ts.
+ =

+ =3D=3D Known Risks =3D=3D
+ =3D=3D=3D Orphaned Products =3D=3D=3D
+ The core developers of Griffin team work full time on this project. There=
 is no risk of Griffin getting orphaned since at least one large company (e=
Bay) is extensively using it in their production Hadoop and Spark clusters =
for multiple data systems. For example, currently there are 4 data systems =
at eBay (real time personalization platform, eBay real time ID linking plat=
form, Hadoop, Site speed analytics platform) are leveraging Griffin, with m=
ore than ~600M records for data quality status validation every day, 35 dat=
a sets being monitored, 50+ data quality models have been created.
+ =

+ As Griffin is designed to connect many types of data sources, we are very=
 confident that they will use Griffin as a service for ensuring the data qu=
ality in open source data ecosystems. We plan to extend and diversify this =
community further through Apache.
+ =

+ =3D=3D=3D Inexperience with Open Source =3D=3D=3D
+ Griffin's core engineers are all active users and followers of open sourc=
e projects. They are already committers and contributors to the Griffin Git=
hub project. All have been involved with the source code that has been rele=
ased under an open source license, and several of them also have experience=
 developing code in an open source environment. Though the core set of Deve=
lopers do not have Apache Open Source experience, there are plans to onboar=
d individuals with Apache open source experience on to the project.
+ =

+ =

+ =3D=3D=3D Homogenous Developers =3D=3D=3D
+ The core developers are from eBay. Apache Incubation process encourages a=
n open and diverse meritocratic community. Griffin intends to make every po=
ssible effort to build a diverse, vibrant and involved community. We are co=
mmitted to recruiting additional committers from other companies based on t=
heir contribution to the project.
+ =

+ =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D
+ eBay invested in Griffin as a company-wide data quality service platform =
and some of its key engineers are working full time on the project. they ar=
e all paid by eBay. We look forward to other Apache developers and research=
ers to contribute to the project.
+ =

+ =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D
+ Griffin has a strong relationship and dependency with Apache Hadoop, Apac=
he HBase, Apache Spark, Apache Kafka and Apache Storm, Apache Hive. In addi=
tion, since there is a growing need for data quality solution for open sour=
ce platform (e.g. Hadoop, Kafka, Spark etc), being part of Apache=E2=80=99s=
 Incubation community, could help with a closer collaboration among these f=
our projects and as well as others.
+ =

+ =3D=3D Documentation =3D=3D
+ Information about Griffin can be found at https://github.com/eBay/griffin
+ =

+ =3D=3D Initial Source =3D=3D
+ Griffin has been under development since early 2016 by a team of engineer=
s at eBay Inc. It is currently hosted on Github.com under an Apache license=
 2.0 at https://github.com/eBay/griffin . Once in incubation we will be mov=
ing the code base to apache git library.
+ =

+ =3D=3D External Dependencies =3D=3D
+ Griffin has the following external dependencies.
+ =

+ ''' Basic '''
+  * JDK 1.7+
+  * Scala
+  * Apache Maven
+  * JUnit
+  * Log4j
+  * Slf4j
+  * Apache Commons
+ =

+ '''Hadoop'''
+  * Apache Hadoop
+  * Apache HBase
+  * Apache Hive
+ =

+ '''DB'''
+  * InfluxData
+ =

+ '''Apache Spark'''
+  * Spark Core Library
+ =

+ '''REST Service'''
+  * Jersey
+  * Spring MVC
+ =

+ '''Web frontend'''
+  * AngularJS
+  * jQuery
+  * Bootstrap
+  * RequireJS
+  * eCharts
+  * Font Awesome
+ =

+ =3D=3D Cryptography =3D=3D
+ Currently there's no cryptography in Griffin.
+ =

+ =3D=3D Required Resources =3D=3D
+ =3D=3D=3D Mailing List =3D=3D=3D
+ We currently use eBay mail box to communicate, but we'd like to move that=
 to ASF maintained mailing lists.
+ =

+ Current mailing list: ebay-griffin-devs@googlegroups.com
+ =

+ Proposed ASF maintained lists: private@griffin.incubator.apache.org
+ =

+ dev@griffin.incubator.apache.org
+ =

+ commits@griffin.incubator.apache.org
+ =

+ =3D=3D=3D Subversion Directory =3D=3D=3D
+ Git is the preferred source control system.
+ =

+ =3D=3D=3D Issue Tracking =3D=3D=3D
+ JIRA
+ =

+ =3D=3D=3D Other Resources =3D=3D=3D
+ The existing code already has unit tests so we will make use of existing =
Apache continuous testing infrastructure. The resulting load should not be =
very large.
+ =

+ =3D=3D Initial Committers =3D=3D
+ William Go
+ Alex Lv =

+ Vincent Zhao =

+ Shawn Sha =

+ John Liu =

+ Liang Shao =

+ =

+ =3D=3D Affiliations =3D=3D
+ The initial committers are employees of eBay Inc.
+ =

+ =3D=3D Sponsors =3D=3D
+ =3D=3D=3D Champion =3D=3D=3D
+ Henry Saputra(hsaputra@apache.org) - Apache IPMC member =

+ =

+ =3D=3D=3D Nominated Mentors =3D=3D=3D
+ Kasper S=C3=B8rensen(kaspersor@apache.org), Uma Maheswara Rao Gangumalla(=
umamahesh@apache.org), Luciano Resende(luckbr1975@gmail.com)
+ =

+ =3D=3D=3D Sponsoring Entity =3D=3D=3D
+ We are requesting the Incubator to sponsor this project.
+=20

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org