Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B7326200BC3 for ; Fri, 18 Nov 2016 22:13:14 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B5CD1160B04; Fri, 18 Nov 2016 21:13:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B4A80160B03 for ; Fri, 18 Nov 2016 22:13:13 +0100 (CET) Received: (qmail 13139 invoked by uid 500); 18 Nov 2016 21:13:13 -0000 Mailing-List: contact cvs-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list cvs@incubator.apache.org Received: (qmail 13130 invoked by uid 99); 18 Nov 2016 21:13:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Nov 2016 21:13:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7F9EBC0FD4 for ; Fri, 18 Nov 2016 21:13:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id adgaF5pftajb for ; Fri, 18 Nov 2016 21:13:09 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id E74285F403 for ; Fri, 18 Nov 2016 21:13:08 +0000 (UTC) Received: from moin-vm.apache.org (moin-vm.apache.org [163.172.69.106]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 322F6E049C for ; Fri, 18 Nov 2016 21:13:07 +0000 (UTC) Received: from moin-vm.apache.org (localhost [IPv6:::1]) by moin-vm.apache.org (ASF Mail Server at moin-vm.apache.org) with ESMTP id F0FC08018E for ; Fri, 18 Nov 2016 22:12:49 +0100 (CET) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Fri, 18 Nov 2016 21:12:49 -0000 Message-ID: <147950356990.30572.12807797323529397455@moin-vm.apache.org> Subject: =?utf-8?q?=5BIncubator_Wiki=5D_Update_of_=22GriffinProposal=22_by_alexlv?= Auto-Submitted: auto-generated archived-at: Fri, 18 Nov 2016 21:13:14 -0000 Dear Wiki user, You have subscribed to a wiki page or wiki category on "Incubator Wiki" for= change notification. The "GriffinProposal" page has been changed by alexlv: https://wiki.apache.org/incubator/GriffinProposal?action=3Ddiff&rev1=3D1&re= v2=3D2 * Data quality health monitoring visualization * Shared infrastructure resource management = + =3D=3D Overview of Griffin =3D=3D + Griffin has been deployed in production at eBay serving major data system= s, it takes a platform approach to provide generic features to solve common= data quality validation pain points. Firstly, user can register the data a= sset which user wants to do data quality check. The data asset can be batch= data in RDBMS (e.g.Teradata), Apache Hadoop system or near real-time strea= ming data from Apache Kafka, Apache Storm and other real time data platform= s. Secondly, user can create data quality model to define the data quality = rule and metadata. Thirdly, the model or rule will be executed automaticall= y (by the model engine) to get the sample data quality validation results i= n a few seconds for streaming data. Finally, user can analyze the data qual= ity results through built-in visualization tool to take actions. + = + Griffin includes: + = + '''Data Quality Model Engine''' = + = + Griffin is model driven solution, user can choose various data quality di= mension to execute his/her data quality validation based on selected target= data-set or source data-set ( as the golden reference data). It has a corr= esponding library supporting it in back-end for the following measurement: + * Accuracy - Does data reflect the real-world objects or a verifiable so= urce + * Completeness - Is all necessary data present + * Validity - Are all data values within the data domains specified by th= e business + * Timeliness - Is the data available at the time needed + * Anomaly detection - Pre-built algorithm functions for the identificati= on of items, events or observations which do not conform to an expected pat= tern or other items in a dataset + * Data Profiling - Apply statistical analysis and assessment of data val= ues within a dataset for consistency, uniqueness and logic. + = + '''Data Collection Layer''' + = + We support two kinds of data sources, batch data and real time data. + = + For batch mode, we can collect data source from Apache Hadoop based platf= orm by various data connectors. + = + For real time mode, we can connect with messaging system like Kafka to ne= ar real time analysis. + = + '''Data Process and Storage Layer''' + = + For batch analysis, our data quality model will compute data quality metr= ics in our spark cluster based on data source in Apache Hadoop. + = + For near real time analysis, we consume data from messaging system, then = our data quality model will compute our real time data quality metrics in o= ur spark cluster. for data storage, we use time series database in our back= end to fulfill front end request. + = + '''Griffin Service''' + = + We have RESTful web services to accomplish all the functionalities of Gri= ffin, such as register data asset, create data quality model, publish metri= cs, retrieve metrics, add subscription, etc. So, the developers can develop= their own user interface based on these web services. + = + =3D=3D Background =3D=3D + At eBay, when people play with big data in Apache Hadoop (or other stream= ing data), data quality often becomes one big challenge. Different teams ha= ve built customized data quality tools to detect and analyze data quality i= ssues within their own domain. We are thinking to take a platform approach = to provide shared Infrastructure and generic features to solve common data = quality pain points. This would enable us to build trusted data assets. + = + Currently it=E2=80=99s very difficult and costly to do data quality valid= ation when we have big data flow across multi-platforms at eBay (e.g. Oracl= e, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka, MongoDB). Take= eBay real time personalization platform as an example. Every day we have t= o validate data quality status for ~600M records ( imagine we have 150M act= ive users for our website). Data quality often becomes one big challenge bo= th in its streaming and batch pipelines. + = + So we conclude 3 data quality problems at eBay: + = + 1. Lack of end2end unified view of data quality measurement from multipl= e data sources to target applications, it usually takes a long time to iden= tify and fix poor data quality. + 2. How to get data quality measured in streaming mode, we need to have a= process and tool to visualize data quality insights through registering da= taset which you want to check data quality, creating data quality measureme= nt model, executing the data quality validation job and getting metrics ins= ights for action taking. + 3. No Shared platform and API Service, have to apply and manage own hard= ware and software infrastructure. + = + =3D=3D Rationale =3D=3D + The challenge we face at eBay is that our data volume is becoming bigger = and bigger, system processes become more complex, while we do not have a un= ified data quality solution to ensure the trusted data sets which provide c= onfidences on data quality to our data consumers. The key challenges on dat= a quality includes: + = + 1. Existing commercial data quality solution cannot address data quality= lineage among systems, cannot scale out to support fast growing data at eB= ay + 2. Existing eBay's domain specific tools take a long time to identify an= d fix poor data quality when data flowed through multiple systems + 3. Business logic becomes complex, requires data quality system much fle= xible. + 4. Some data quality issues do have business impact on user experiences,= revenue, efficiency & compliance. + 5. Communication overhead of data quality metrics, typically in a big or= ganization, which involve different teams. + = + The idea of Griffin is to provide Data Quality validation as a Service, t= o allow data engineers and data consumers to have: + = + * Near real-time understanding of the data quality health of your data p= ipelines with end-to-end monitoring, all in one place. + * Profiling, detecting and correlating issues and providing recommendati= ons that drive rapid and focused troubleshooting + * A centralized data quality model management system including rule, met= adata, scheduler etc. + * Native code generation to run everywhere, including Hadoop, Kafka, Spa= rk, etc. + * One set of tools to build data quality pipelines across all eBay data = platforms. + = + =3D=3D Current Status =3D=3D + =3D=3D=3D Meritocracy =3D=3D=3D + Griffin has been deployed in production at eBay and provided the centrali= zed data quality service for several eBay systems ( for example, real time = personalization platform, eBay real time ID linking platform, Hadoop datase= ts, Site speed analytics platform). Our aim is to build a diverse developer= and user community following the Apache meritocracy model. We will encoura= ge contributions and participation of all types of work, and ensure that co= ntributors are appropriately recognized. + = + =3D=3D=3D Community =3D=3D=3D + Currently the project is being developed at eBay. It's only for eBay inte= rnal community. Griffin seeks to develop the developer and user communities= during incubation. We believe it will grow substantially by becoming an Ap= ache project. + = + =3D=3D=3D Core Developers =3D=3D=3D + Griffin is currently being designed and developed by engineers from eBay = Inc. =E2=80=93 William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu. All= of these core developers have deep expertise in Apache Hadoop and the Hado= op Ecosystem in general. + = + =3D=3D=3D Alignment =3D=3D=3D + The ASF is a natural host for Griffin given that it is already the home o= f Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other emerging big dat= a products. Those are requiring data quality solution by nature to ensure t= he data quality which they processed. When people use open source data tech= nology, the big question to them is that how we can ensure the data quality= in it. Griffin leverages lot of Apache open-source products. Griffin was d= esigned to enable real time insights into data quality validation by shared= Infrastructure and generic features to solve common data quality pain poin= ts. + = + =3D=3D Known Risks =3D=3D + =3D=3D=3D Orphaned Products =3D=3D=3D + The core developers of Griffin team work full time on this project. There= is no risk of Griffin getting orphaned since at least one large company (e= Bay) is extensively using it in their production Hadoop and Spark clusters = for multiple data systems. For example, currently there are 4 data systems = at eBay (real time personalization platform, eBay real time ID linking plat= form, Hadoop, Site speed analytics platform) are leveraging Griffin, with m= ore than ~600M records for data quality status validation every day, 35 dat= a sets being monitored, 50+ data quality models have been created. + = + As Griffin is designed to connect many types of data sources, we are very= confident that they will use Griffin as a service for ensuring the data qu= ality in open source data ecosystems. We plan to extend and diversify this = community further through Apache. + = + =3D=3D=3D Inexperience with Open Source =3D=3D=3D + Griffin's core engineers are all active users and followers of open sourc= e projects. They are already committers and contributors to the Griffin Git= hub project. All have been involved with the source code that has been rele= ased under an open source license, and several of them also have experience= developing code in an open source environment. Though the core set of Deve= lopers do not have Apache Open Source experience, there are plans to onboar= d individuals with Apache open source experience on to the project. + = + = + =3D=3D=3D Homogenous Developers =3D=3D=3D + The core developers are from eBay. Apache Incubation process encourages a= n open and diverse meritocratic community. Griffin intends to make every po= ssible effort to build a diverse, vibrant and involved community. We are co= mmitted to recruiting additional committers from other companies based on t= heir contribution to the project. + = + =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D + eBay invested in Griffin as a company-wide data quality service platform = and some of its key engineers are working full time on the project. they ar= e all paid by eBay. We look forward to other Apache developers and research= ers to contribute to the project. + = + =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D + Griffin has a strong relationship and dependency with Apache Hadoop, Apac= he HBase, Apache Spark, Apache Kafka and Apache Storm, Apache Hive. In addi= tion, since there is a growing need for data quality solution for open sour= ce platform (e.g. Hadoop, Kafka, Spark etc), being part of Apache=E2=80=99s= Incubation community, could help with a closer collaboration among these f= our projects and as well as others. + = + =3D=3D Documentation =3D=3D + Information about Griffin can be found at https://github.com/eBay/griffin + = + =3D=3D Initial Source =3D=3D + Griffin has been under development since early 2016 by a team of engineer= s at eBay Inc. It is currently hosted on Github.com under an Apache license= 2.0 at https://github.com/eBay/griffin . Once in incubation we will be mov= ing the code base to apache git library. + = + =3D=3D External Dependencies =3D=3D + Griffin has the following external dependencies. + = + ''' Basic ''' + * JDK 1.7+ + * Scala + * Apache Maven + * JUnit + * Log4j + * Slf4j + * Apache Commons + = + '''Hadoop''' + * Apache Hadoop + * Apache HBase + * Apache Hive + = + '''DB''' + * InfluxData + = + '''Apache Spark''' + * Spark Core Library + = + '''REST Service''' + * Jersey + * Spring MVC + = + '''Web frontend''' + * AngularJS + * jQuery + * Bootstrap + * RequireJS + * eCharts + * Font Awesome + = + =3D=3D Cryptography =3D=3D + Currently there's no cryptography in Griffin. + = + =3D=3D Required Resources =3D=3D + =3D=3D=3D Mailing List =3D=3D=3D + We currently use eBay mail box to communicate, but we'd like to move that= to ASF maintained mailing lists. + = + Current mailing list: ebay-griffin-devs@googlegroups.com + = + Proposed ASF maintained lists: private@griffin.incubator.apache.org + = + dev@griffin.incubator.apache.org + = + commits@griffin.incubator.apache.org + = + =3D=3D=3D Subversion Directory =3D=3D=3D + Git is the preferred source control system. + = + =3D=3D=3D Issue Tracking =3D=3D=3D + JIRA + = + =3D=3D=3D Other Resources =3D=3D=3D + The existing code already has unit tests so we will make use of existing = Apache continuous testing infrastructure. The resulting load should not be = very large. + = + =3D=3D Initial Committers =3D=3D + William Go + Alex Lv = + Vincent Zhao = + Shawn Sha = + John Liu = + Liang Shao = + = + =3D=3D Affiliations =3D=3D + The initial committers are employees of eBay Inc. + = + =3D=3D Sponsors =3D=3D + =3D=3D=3D Champion =3D=3D=3D + Henry Saputra(hsaputra@apache.org) - Apache IPMC member = + = + =3D=3D=3D Nominated Mentors =3D=3D=3D + Kasper S=C3=B8rensen(kaspersor@apache.org), Uma Maheswara Rao Gangumalla(= umamahesh@apache.org), Luciano Resende(luckbr1975@gmail.com) + = + =3D=3D=3D Sponsoring Entity =3D=3D=3D + We are requesting the Incubator to sponsor this project. +=20 --------------------------------------------------------------------- To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org For additional commands, e-mail: cvs-help@incubator.apache.org