Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 640121887D for ; Mon, 26 Oct 2015 02:51:01 +0000 (UTC) Received: (qmail 53125 invoked by uid 500); 26 Oct 2015 02:51:00 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 52909 invoked by uid 500); 26 Oct 2015 02:51:00 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 52898 invoked by uid 99); 26 Oct 2015 02:51:00 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Oct 2015 02:51:00 +0000 Received: from mail-io0-f179.google.com (mail-io0-f179.google.com [209.85.223.179]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 4AEF41A0094 for ; Mon, 26 Oct 2015 02:51:00 +0000 (UTC) Received: by iody8 with SMTP id y8so17637879iod.1 for ; Sun, 25 Oct 2015 19:50:59 -0700 (PDT) X-Received: by 10.107.159.72 with SMTP id i69mr6518787ioe.4.1445827859500; Sun, 25 Oct 2015 19:50:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.146.133 with HTTP; Sun, 25 Oct 2015 19:50:40 -0700 (PDT) In-Reply-To: References: From: hongbin ma Date: Mon, 26 Oct 2015 10:50:40 +0800 Message-ID: Subject: Re: [VOTE] Accept Eagle into Apache Incubation To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=001a1141bd32f904570522f9077c --001a1141bd32f904570522f9077c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable +1 (non binding) On Mon, Oct 26, 2015 at 12:20 AM, Ralph Goers wrote: > +1 (binding) > > Ralph > > > On Oct 23, 2015, at 7:11 AM, Manoharan, Arun > wrote: > > > > Hello Everyone, > > > > Thanks for all the feedback on the Eagle Proposal. > > > > I would like to call for a [VOTE] on Eagle joining the ASF as an > incubation project. > > > > The vote is open for 72 hours: > > > > [ ] +1 accept Eagle in the Incubator > > [ ] =C2=B10 > > [ ] -1 (please give reason) > > > > Eagle is a Monitoring solution for Hadoop to instantly identify access > to sensitive data, recognize attacks, malicious activities and take actio= ns > in real time. Eagle supports a wide variety of policies on HDFS data and > Hive. Eagle also provides machine learning models for detecting anomalous > user behavior in Hadoop. > > > > The proposal is available on the wiki here: > > https://wiki.apache.org/incubator/EagleProposal > > > > The text of the proposal is also available at the end of this email. > > > > Thanks for your time and help. > > > > Thanks, > > Arun > > > > > > > > Eagle > > > > Abstract > > Eagle is an Open Source Monitoring solution for Hadoop to instantly > identify access to sensitive data, recognize attacks, malicious activitie= s > in hadoop and take actions. > > > > Proposal > > Eagle audits access to HDFS files, Hive and HBase tables in real time, > enforces policies defined on sensitive data access and alerts or blocks > user=E2=80=99s access to that sensitive data in real time. Eagle also cre= ates user > profiles based on the typical access behaviour for HDFS and Hive and send= s > alerts when anomalous behaviour is detected. Eagle can also import > sensitive data information classified by external classification engines = to > help define its policies. > > > > Overview of Eagle > > Eagle has 3 main parts. > > 1.Data collection and storage - Eagle collects data from various hadoop > logs in real time using Kafka/Yarn API and uses HDFS and HBase for storag= e. > > 2.Data processing and policy engine - Eagle allows users to create > policies based on various metadata properties on HDFS, Hive and HBase dat= a. > > 3.Eagle services - Eagle services include policy manager, query service > and the visualization component. Eagle provides intuitive user interface = to > administer Eagle and an alert dashboard to respond to real time alerts. > > > > Data Collection and Storage: > > Eagle provides programming API for extending Eagle to integrate any dat= a > source into Eagle policy evaluation framework. For example, Eagle hdfs > audit monitoring collects data from Kafka which is populated from namenod= e > log4j appender or from logstash agent. Eagle hive monitoring collects hiv= e > query logs from running job through YARN API, which is designed to be > scalable and fault-tolerant. Eagle uses HBase as storage for storing > metadata and metrics data, and also supports relational database through > configuration change. > > > > Data Processing and Policy Engine: > > Processing Engine: Eagle provides stream processing API which is an > abstraction of Apache Storm. It can also be extended to other streaming > engines. This abstraction allows developers to assemble data > transformation, filtering, external data join etc. without physically bou= nd > to a specific streaming platform. Eagle streaming API allows developers t= o > easily integrate business logic with Eagle policy engine and internally > Eagle framework compiles business logic execution DAG into program > primitives of underlying stream infrastructure e.g. Apache Storm. For > example, Eagle HDFS monitoring transforms audit log from Namenode to obje= ct > and joins sensitivity metadata, security zone metadata which are generate= d > from external programs or configured by user. Eagle hive monitoring filte= rs > running jobs to get hive query string and parses query string into object > and then joins sensitivity metadata. > > Alerting Framework: Eagle Alert Framework includes stream metadata API, > scalable policy engine framework, extensible policy engine framework. > Stream metadata API allows developers to declare event schema including > what attributes constitute an event, what is the type for each attribute, > and how to dynamically resolve attribute value in runtime when user > configures policy. Scalable policy engine framework allows policies to be > executed on different physical nodes in parallel. It is also used to defi= ne > your own policy partitioner class. Policy engine framework together with > streaming partitioning capability provided by all streaming platforms wil= l > make sure policies and events can be evaluated in a fully distributed way= . > Extensible policy engine framework allows developer to plugin a new polic= y > engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy > engine which Eagle supports as first-class citizen. > > Machine Learning module: Eagle provides capabilities to define user > activity patterns or user profiles for Hadoop users based on the user > behaviour in the platform. These user profiles are modeled using Machine > Learning algorithms and used for detection of anomalous users activities. > Eagle uses Eigen Value Decomposition, and Density Estimation algorithms f= or > generating user profile models. The model reads data from HDFS audit logs= , > preprocesses and aggregates data, and generates models using Spark > programming APIs. Once models are generated, Eagle uses stream processing > engine for near real-time anomaly detection to determine if any user=E2= =80=99s > activities are suspicious or not. > > > > Eagle Services: > > Query Service: Eagle provides SQL-like service API to support > comprehensive computation for huge set of data on the fly, for e.g. > comprehensive filtering, aggregation, histogram, sorting, top, arithmetic= al > expression, pagination etc. HBase is the data storage which Eagle support= s > as first-class citizen, relational database is supported as well. For HBa= se > storage, Eagle query framework compiles user provided SQL-like query into > HBase native filter objects and execute it through HBase coprocessor on t= he > fly. > > Policy Manager: Eagle policy manager provides UI and Restful API for > user to define policy with just a few clicks. It includes site management > UI, policy editor, sensitivity metadata import, HDFS or Hive sensitive > resource browsing, alert dashboards etc. > > Background > > Data is one of the most important assets for today=E2=80=99s businesses= , which > makes data security one of the top priorities of today=E2=80=99s enterpri= ses. > Hadoop is widely used across different verticals as a big data repository > to store this data in most modern enterprises. > > At eBay we use hadoop platform extensively for our data processing > needs. Our data in Hadoop is becoming bigger and bigger as our user base = is > seeing an exponential growth. Today there are variety of data sets > available in Hadoop cluster for our users to consume. eBay has around 120 > PB of data stored in HDFS across 6 different clusters and around 1800+ > active hadoop users consuming data thru Hive, HBase and mapreduce jobs > everyday to build applications using this data. With this astronomical > growth of data there are also challenges in securing sensitive data and > monitoring the access to this sensitive data. Today in large organization= s > HDFS is the defacto standard for storing big data. Data sets which includ= es > and not limited to consumer sentiment, social media data, customer > segmentation, web clicks, sensor data, geo-location and transaction data > get stored in Hadoop for day to day business needs. > > We at eBay want to make sure the sensitive data and data platforms are > completely protected from security breaches. So we partnered very closely > with our Information Security team to understand the requirements for Eag= le > to monitor sensitive data access on hadoop: > > 1.Ability to identify and stop security threats in real time > > 2.Scale for big data (Support PB scale and Billions of events) > > 3.Ability to create data access policies > > 4.Support multiple data sources like HDFS, HBase, Hive > > 5.Visualize alerts in real time > > 6.Ability to block malicious access in real time > > We did not find any data access monitoring solution that available toda= y > and can provide the features and functionality that we need to monitor th= e > data access in the hadoop ecosystem at our scale. Hence with an excellent > team of world class developers and several users, we have been able to > bring Eagle into production as well as open source it. > > > > Rationale > > In today=E2=80=99s world; data is an important asset for any company. B= usinesses > are using data extensively to create amazing experiences for users. Data > has to be protected and access to data should be secured from security > breaches. Today Hadoop is not only used to store logs but also stores > financial data, sensitive data sets, geographical data, user click stream > data sets etc. which makes it more important to be protected from securit= y > breaches. To secure a data platform there are multiple things that need t= o > happen. One is having a strong access control mechanism which today is > provided by Apache Ranger and Apache Sentry. These tools provide the > ability to provide fine grain access control mechanism to data sets on > hadoop. But there is a big gap in terms of monitoring all the data access > events and activities in order to securing the hadoop data platform. > Together with strong access control, perimeter security and data access > monitoring in place data in the hadoop clusters can be secured against > breaches. We looked around and found following: > > Existing data activity monitoring products are designed for traditional > databases and data warehouse. Existing monitoring platforms cannot scale > out to support fast growing data and petabyte scale. Few products in the > industry are still very early in terms of supporting HDFS, Hive, HBase da= ta > access monitoring. > > As mentioned in the background, the business requirement and urgency to > secure the data from users with malicious intent drove eBay to invest in > building a real time data access monitoring solution from scratch to offe= r > real time alerts and remediation features for malicious data access. > > With the power of open source distributed systems like Hadoop, Kafka an= d > much more we were able to develop a data activity monitoring system that > can scale, identify and stop malicious access in real time. > > Eagle allows admins to create standard access policies and rules for > monitoring HDFS, Hive and HBase data. Eagle also provides out of box > machine learning models for modeling user profiles based on user access > behaviour and use the model to alert on anomalies. > > > > Current Status > > > > Meritocracy > > Eagle has been deployed in production at eBay for monitoring billions o= f > events per day from HDFS and Hive operations. From the start; the product > has been built with focus on high scalability and application extensibili= ty > in mind and Eagle has demonstrated great performance in responding to > suspicious events instantly and great flexibility in defining policy. > > > > Community > > Eagle seeks to develop the developer and user communities during > incubation. > > > > Core Developers > > Eagle is currently being designed and developed by engineers from eBay > Inc. =E2=80=93 Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin J= iang, > Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of > these core developers have deep expertise in developing monitoring produc= ts > for the Hadoop ecosystem. > > > > Alignment > > The ASF is a natural host for Eagle given that it is already the home o= f > Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data > projects. Eagle leverages lot of Apache open-source products. Eagle was > designed to offer real time insights into sensitive data access by active= ly > monitoring the data access on various data sets in hadoop and an extensib= le > alerting framework with a powerful policy engine. Eagle compliments the > existing Hadoop platform area by providing a comprehensive monitoring and > alerting solution for detecting sensitive data access threats based on > preset policies and machine learning models for user behaviour analysis. > > > > Known Risks > > > > Orphaned Products > > The core developers of Eagle team work full time on this project. There > is no risk of Eagle getting orphaned since eBay is extensively using it i= n > their production Hadoop clusters and have plans to go beyond hadoop. For > example, currently there are 7 hadoop clusters and 2 of them are being > monitored using Hadoop Eagle in production. We have plans to extend it to > all hadoop clusters and eventually other data platforms. There are 10=E2= =80=99s of > policies onboarded and actively monitored with plans to onboard more use > case. We are very confident that every hadoop cluster in the world will b= e > monitored using Eagle for securing the hadoop ecosystem by actively > monitoring for data access on sensitive data. We plan to extend and > diversify this community further through Apache. We presented Eagle at th= e > hadoop summit in china and garnered interest from different companies who > use hadoop extensively. > > > > Inexperience with Open Source > > The core developers are all active users and followers of open source. > They are already committers and contributors to the Eagle Github project. > All have been involved with the source code that has been released under = an > open source license, and several of them also have experience developing > code in an open source environment. Though the core set of Developers do > not have Apache Open Source experience, there are plans to onboard > individuals with Apache open source experience on to the project. Apache > Kylin PMC members are also in the same ebay organization. We work very > closely with Apache Ranger committers and are looking forward to find > meaningful integrations to improve the security of hadoop platform. > > > > Homogenous Developers > > The core developers are from eBay. Today the problem of monitoring data > activities to find and stop threats is a universal problem faced by all t= he > businesses. Apache Incubation process encourages an open and diverse > meritocratic community. Eagle intends to make every possible effort to > build a diverse, vibrant and involved community and has already received > substantial interest from various organizations. > > > > Reliance on Salaried Developers > > eBay invested in Eagle as the monitoring solution for Hadoop clusters > and some of its key engineers are working full time on the project. In > addition, since there is a growing need for securing sensitive data acces= s > we need a data activity monitoring solution for Hadoop, we look forward t= o > other Apache developers and researchers to contribute to the project. > Additional contributors, including Apache committers have plans to join > this effort shortly. Also key to addressing the risk associated with > relying on Salaried developers from a single entity is to increase the > diversity of the contributors and actively lobby for Domain experts in th= e > security space to contribute. Eagle intends to do this. > > > > Relationships with Other Apache Products > > Eagle has a strong relationship and dependency with Apache Hadoop, > HBase, Spark, Kafka and Storm. Being part of Apache=E2=80=99s Incubation = community, > could help with a closer collaboration among these projects and as well a= s > others. An Excessive Fascination with the Apache Brand Eagle is proposing > to enter incubation at Apache in order to help efforts to diversify the > committer-base, not so much to capitalize on the Apache brand. The Eagle > project is in production use already inside eBay, but is not expected to = be > an eBay product for external customers. As such, the Eagle project is not > seeking to use the Apache brand as a marketing tool. > > > > Documentation > > Information about Eagle can be found at https://github.com/eBay/Eagle. > The following link provide more information about Eagle http://goeagle.io= < > http://goeagle.io/>. > > > > Initial Source > > Eagle has been under development since 2014 by a team of engineers at > eBay Inc. It is currently hosted on Github.com under an Apache license 2.= 0 > at https://github.com/eBay/Eagle. Once in incubation we will be moving > the code base to apache git library. > > > > External Dependencies > > Eagle has the following external dependencies. > > Basic > > =E2=80=A2JDK 1.7+ > > =E2=80=A2Scala 2.10.4 > > =E2=80=A2Apache Maven > > =E2=80=A2JUnit > > =E2=80=A2Log4j > > =E2=80=A2Slf4j > > =E2=80=A2Apache Commons > > =E2=80=A2Apache Commons Math3 > > =E2=80=A2Jackson > > =E2=80=A2Siddhi CEP engine > > > > Hadoop > > =E2=80=A2Apache Hadoop > > =E2=80=A2Apache HBase > > =E2=80=A2Apache Hive > > =E2=80=A2Apache Zookeeper > > =E2=80=A2Apache Curator > > > > Apache Spark > > =E2=80=A2Spark Core Library > > > > REST Service > > =E2=80=A2Jersey > > > > Query > > =E2=80=A2Antlr > > > > Stream processing > > =E2=80=A2Apache Storm > > =E2=80=A2Apache Kafka > > > > Web > > =E2=80=A2AngularJS > > =E2=80=A2jQuery > > =E2=80=A2Bootstrap V3 > > =E2=80=A2Moment JS > > =E2=80=A2Admin LTE > > =E2=80=A2html5shiv > > =E2=80=A2respond > > =E2=80=A2Fastclick > > =E2=80=A2Date Range Picker > > =E2=80=A2Flot JS > > > > Cryptography > > Eagle will eventually support encryption on the wire. This is not one o= f > the initial goals, and we do not expect Eagle to be a controlled export > item due to the use of encryption. Eagle supports but does not require th= e > Kerberos authentication mechanism to access secured Hadoop services. > > > > Required Resources > > > > Mailing List > > =E2=80=A2eagle-private for private PMC discussions > > =E2=80=A2eagle-dev for developers > > =E2=80=A2eagle-commits for all commits > > =E2=80=A2eagle-users for all eagle users > > > > Subversion Directory > > =E2=80=A2Git is the preferred source control system. > > > > Issue Tracking > > =E2=80=A2JIRA Eagle (Eagle) > > > > Other Resources > > The existing code already has unit tests so we will make use of existin= g > Apache continuous testing infrastructure. The resulting load should not b= e > very large. > > > > Initial Committers > > =E2=80=A2Seshu Adunuthula > > =E2=80=A2Arun Manoharan > > =E2=80=A2Edward Zhang > > =E2=80=A2Hao Chen > > =E2=80=A2Chaitali Gupta > > =E2=80=A2Libin Sun > > =E2=80=A2Jilin Jiang > > =E2=80=A2Qingwen Zhao > > =E2=80=A2Hemanth Dendukuri > > =E2=80=A2Senthil Kumar > > > > > > Affiliations > > The initial committers are employees of eBay Inc. > > > > Sponsors > > > > Champion > > =E2=80=A2Henry Saputra - Apache IPMC membe= r > > > > Nominated Mentors > > =E2=80=A2Owen O=E2=80=99Malley < omalley at apache dot org > - Apache I= PMC member, > Hortonworks > > =E2=80=A2Henry Saputra - Apache IPMC membe= r > > =E2=80=A2Julian Hyde - Apache IPMC membe= r, > Hortonworks > > =E2=80=A2Amareshwari Sriramdasu - Apach= e IPMC > member > > =E2=80=A2Taylor Goetz - Apache IPMC member, > Hortonworks > > > > Sponsoring Entity > > We are requesting the Incubator to sponsor this project. > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org > > --=20 Regards, *Bin Mahone | =E9=A9=AC=E6=B4=AA=E5=AE=BE* Apache Kylin: http://kylin.io Github: https://github.com/binmahone --001a1141bd32f904570522f9077c--