Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D7B67200BD3 for ; Tue, 6 Dec 2016 08:48:27 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D65ED160B29; Tue, 6 Dec 2016 07:48:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D35C4160B0C for ; Tue, 6 Dec 2016 08:48:26 +0100 (CET) Received: (qmail 48072 invoked by uid 500); 6 Dec 2016 07:48:25 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 48050 invoked by uid 99); 6 Dec 2016 07:48:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2016 07:48:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id EF2DC185F2F for ; Tue, 6 Dec 2016 07:48:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.12 X-Spam-Level: X-Spam-Status: No, score=-0.12 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id X9HrKm1gIdNG for ; Tue, 6 Dec 2016 07:48:21 +0000 (UTC) Received: from mail-ua0-f178.google.com (mail-ua0-f178.google.com [209.85.217.178]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E55465F5F9 for ; Tue, 6 Dec 2016 07:48:20 +0000 (UTC) Received: by mail-ua0-f178.google.com with SMTP id 12so373962066uas.2 for ; Mon, 05 Dec 2016 23:48:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=8Fuk1ix7PgdRCMkQ85JIyy8/hb1m+aHyRClUgWLd6lQ=; b=m49KHHmVZRKwror4KDKQ7/kkszNk+aXC1lx3Ts0u6zbK6hpoD81PfLl9OyhlU39vND LgcT9XqgsWpfdEQgT1x/gmXdjkbv/Gqly+8lJocUePX7w8EMPqqMZcZWZtHeEQ5IhxM3 crZduKZYyT1ZSA9IQ2P/tV1f0JWeAmay6lojHIblQMKhpMEBp5YVl5iGDxK4d75CSacW nl2lXzPVM4gJSjBRdlxQtaMMQomUrHcHhvOg8KPX+WiP3Syifi2TCvrj4plpPaEFav1o ClYIcd5mb8uphX69ouRAZ4HoWnOPi0kqIa6SY1qsP9K37zOWh0n55WJM7lE984D+M0xx Kxhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=8Fuk1ix7PgdRCMkQ85JIyy8/hb1m+aHyRClUgWLd6lQ=; b=jOBdOyAJCD7FbJ050htpk1o5uvgoq/cfcMEVZjkgBU3O2GkYQCN1EGyzvCXlqQj9Jb khn7Z6Fx0KjQOLY5TduWBGe2CnbH0ui/Jf1nvczzm2YaudW4PhNhFiw2HV4cCK4oKv9o k2KxMvSwU4Vjii/EMCdCgYGVuWBJzKobYyrpRAMOnk1soLxVIzsKyh3f1XwiEUij2ROZ nn98elMQbEnkM9K3DcyMfGgn+vXXdTRgZOvvE7QIs9vBnl3FyS3+uZIiqHnpkE8NL5B8 x7Qj6OaMvVAFJYQowDRUEGRtVMyE2kY6MMDPWoITOhZM+cQeBOo2W7wY8BRWo69jkqXN mMDw== X-Gm-Message-State: AKaTC01diKm8q8JMYM2Radeuk+e/78rDG9AR9ufPlGeyQVG7yxs4bG7TWayTIz7Pj6BFM8KbGU3SOoPc41cb2A== X-Received: by 10.176.1.180 with SMTP id 49mr48263013ual.55.1481010483487; Mon, 05 Dec 2016 23:48:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.103.45.65 with HTTP; Mon, 5 Dec 2016 23:48:02 -0800 (PST) From: Henry Saputra Date: Mon, 5 Dec 2016 23:48:02 -0800 Message-ID: Subject: [RESULT] [VOTE] Bring Griffin to Apache Incubator To: "general@incubator.apache.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Tue, 06 Dec 2016 07:48:28 -0000 Thank you for all that participated in the VOTE. The VOTE has ended and have this result : 8 +1 binding votes (Stian, Henry, Andrew, Jacques, Uma, Luciano, Jullian, Kasper) 5 +1 non-binding votes No 0 votes No -1 votes With this Griffin is officially accepted as Apache incubator project. Congrats! I will move on with bootstrapping process. - Henry On Wed, Nov 30, 2016 at 10:40 PM, Henry Saputra w= rote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable sourc= e > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the b= usiness > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back end to fulfill front end request. > > Griffin Service > > We have RESTful web services to accomplish all the functionalities of > Griffin, such as register data asset, create data quality model, > publish metrics, retrieve metrics, add subscription, etc. So, the > developers can develop their own user interface based on these web > services. > > Background > > At eBay, when people play with big data in Apache Hadoop (or other > streaming data), data quality often becomes one big challenge. > Different teams have built customized data quality tools to detect and > analyze data quality issues within their own domain. We are thinking > to take a platform approach to provide shared Infrastructure and > generic features to solve common data quality pain points. This would > enable us to build trusted data assets. > > Currently it=E2=80=99s very difficult and costly to do data quality valid= ation > when we have big data flow across multi-platforms at eBay (e.g. > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka, > MongoDB). Take eBay real time personalization platform as an example. > Every day we have to validate data quality status for ~600M records ( > imagine we have 150M active users for our website). Data quality often > becomes one big challenge both in its streaming and batch pipelines. > > So we conclude 3 data quality problems at eBay: > > Lack of end2end unified view of data quality measurement from multiple > data sources to target applications, it usually takes a long time to > identify and fix poor data quality. > How to get data quality measured in streaming mode, we need to have a > process and tool to visualize data quality insights through > registering dataset which you want to check data quality, creating > data quality measurement model, executing the data quality validation > job and getting metrics insights for action taking. > No Shared platform and API Service, have to apply and manage own > hardware and software infrastructure. > > Rationale > > The challenge we face at eBay is that our data volume is becoming > bigger and bigger, system processes become more complex, while we do > not have a unified data quality solution to ensure the trusted data > sets which provide confidences on data quality to our data consumers. > The key challenges on data quality includes: > > Existing commercial data quality solution cannot address data quality > lineage among systems, cannot scale out to support fast growing data > at eBay > Existing eBay's domain specific tools take a long time to identify and > fix poor data quality when data flowed through multiple systems > Business logic becomes complex, requires data quality system much flexibl= e. > > Some data quality issues do have business impact on user experiences, > revenue, efficiency & compliance. > > Communication overhead of data quality metrics, typically in a big > organization, which involve different teams. > > The idea of Griffin is to provide Data Quality validation as a > Service, to allow data engineers and data consumers to have: > > Near real-time understanding of the data quality health of your data > pipelines with end-to-end monitoring, all in one place. > Profiling, detecting and correlating issues and providing > recommendations that drive rapid and focused troubleshooting > A centralized data quality model management system including rule, > metadata, scheduler etc. > Native code generation to run everywhere, including Hadoop, Kafka, Spark,= etc. > One set of tools to build data quality pipelines across all eBay data pla= tforms. > > Current Status > > Meritocracy > > Griffin has been deployed in production at eBay and provided the > centralized data quality service for several eBay systems ( for > example, real time personalization platform, eBay real time ID linking > platform, Hadoop datasets, Site speed analytics platform). Our aim is > to build a diverse developer and user community following the Apache > meritocracy model. We will encourage contributions and participation > of all types of work, and ensure that contributors are appropriately > recognized. > > Community > > Currently the project is being developed at eBay. It's only for eBay > internal community. Griffin seeks to develop the developer and user > communities during incubation. We believe it will grow substantially > by becoming an Apache project. > > Core Developers > > Griffin is currently being designed and developed by engineers from > eBay Inc. =E2=80=93 William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John L= iu. > All of these core developers have deep expertise in Apache Hadoop and > the Hadoop Ecosystem in general. > > Alignment > > The ASF is a natural host for Griffin given that it is already the > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other > emerging big data products. Those are requiring data quality solution > by nature to ensure the data quality which they processed. When people > use open source data technology, the big question to them is that how > we can ensure the data quality in it. Griffin leverages lot of Apache > open-source products. Griffin was designed to enable real time > insights into data quality validation by shared Infrastructure and > generic features to solve common data quality pain points. > > Known Risks > > Orphaned Products > > The core developers of Griffin team work full time on this project. > There is no risk of Griffin getting orphaned since at least one large > company (eBay) is extensively using it in their production Hadoop and > Spark clusters for multiple data systems. For example, currently there > are 4 data systems at eBay (real time personalization platform, eBay > real time ID linking platform, Hadoop, Site speed analytics platform) > are leveraging Griffin, with more than ~600M records for data quality > status validation every day, 35 data sets being monitored, 50+ data > quality models have been created. > > As Griffin is designed to connect many types of data sources, we are > very confident that they will use Griffin as a service for ensuring > the data quality in open source data ecosystems. We plan to extend and > diversify this community further through Apache. > > Inexperience with Open Source > > Griffin's core engineers are all active users and followers of open > source projects. They are already committers and contributors to the > Griffin Github project. All have been involved with the source code > that has been released under an open source license, and several of > them also have experience developing code in an open source > environment. Though the core set of Developers do not have Apache Open > Source experience, there are plans to onboard individuals with Apache > open source experience on to the project. > > Homogenous Developers > > The core developers are from eBay. Apache Incubation process > encourages an open and diverse meritocratic community. Griffin intends > to make every possible effort to build a diverse, vibrant and involved > community. We are committed to recruiting additional committers from > other companies based on their contribution to the project. > > Reliance on Salaried Developers > > eBay invested in Griffin as a company-wide data quality service > platform and some of its key engineers are working full time on the > project. they are all paid by eBay. We look forward to other Apache > developers and researchers to contribute to the project. > > Relationships with Other Apache Products > > Griffin has a strong relationship and dependency with Apache Hadoop, > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache > Hive. In addition, since there is a growing need for data quality > solution for open source platform (e.g. Hadoop, Kafka, Spark etc), > being part of Apache=E2=80=99s Incubation community, could help with a cl= oser > collaboration among these four projects and as well as others. > > Documentation > > Information about Griffin can be found at https://github.com/eBay/griffin > > Initial Source > > Griffin has been under development since early 2016 by a team of > engineers at eBay Inc. It is currently hosted on Github.com under an > Apache license 2.0 at https://github.com/eBay/griffin . Once in > incubation we will be moving the code base to apache git library. > > External Dependencies > > Griffin has the following external dependencies. > > Basic > > JDK 1.7+ > Scala > Apache Maven > JUnit > Log4j > Slf4j > Apache Commons > > Hadoop > > Apache Hadoop > Apache HBase > Apache Hive > > DB > > InfluxData > > Apache Spark > > Spark Core Library > > REST Service > > Jersey > Spring MVC > > Web frontend > > AngularJS > jQuery > Bootstrap > RequireJS > eCharts > Font Awesome > > Cryptography > > Currently there's no cryptography in Griffin. > > Required Resources > > Mailing List > > We currently use eBay mail box to communicate, but we'd like to move > that to ASF maintained mailing lists. > > Current mailing list: ebay-griffin-devs@googlegroups.com > > Proposed ASF maintained lists: > > private@griffin.incubator.apache.org > > dev@griffin.incubator.apache.org > > commits@griffin.incubator.apache.org > > Subversion Directory > > Git is the preferred source control system. > > Issue Tracking > > JIRA > > Other Resources > > The existing code already has unit tests so we will make use of > existing Apache continuous testing infrastructure. The resulting load > should not be very large. > > Initial Committers > > William Go > Alex Lv > Vincent Zhao > Shawn Sha > John Liu > Liang Shao > > Affiliations > > The initial committers are employees of eBay Inc. > > Sponsors > > Champion > > Henry Saputra (hsaputra@apache.org) > > Nominated Mentors > > Kasper S=C3=B8rensen (kaspersor@apache.org) > > Uma Maheswara Rao Gangumalla (umamahesh@apache.org) > > Luciano Resende (luckbr1975@gmail.com) > > Sponsoring Entity > > We are requesting the Incubator to sponsor this project. --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org