incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "EagleProposal" by ArunManoharan
Date Mon, 19 Oct 2015 06:57:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "EagleProposal" page has been changed by ArunManoharan:
https://wiki.apache.org/incubator/EagleProposal?action=diff&rev1=4&rev2=5

  
  Eagle has 3 main parts. 
  '''Data collection and storage''' - Eagle collects data from various hadoop logs in real
time using Kafka/Yarn API and uses HDFS and HBase for storage.
+ 
  '''Data processing and policy engine''' - Eagle allows users to create policies based on
various metadata properties on HDFS, Hive and HBase data. 
+ 
  '''Eagle services''' - Eagle services include policy manager, query service and the visualization
component. Eagle provides intuitive user interface to administer Eagle and an alert dashboard
to respond to real time alerts. 
  
  === Eagle Architecture ===
@@ -25, +27 @@

  
  === Data Processing and Policy Engine: ===
  
- Processing Engine: Eagle provides stream processing API which is an abstraction of Apache
Storm. It can also be extended to other streaming engines. This abstraction allows developers
to assemble data transformation, filtering, external data join etc. without physically bound
to a specific streaming platform. Eagle streaming API allows developers to easily integrate
business logic with Eagle policy engine and internally Eagle framework compiles business logic
execution DAG into program primitives of underlying stream infrastructure e.g. Apache Storm.
For example, Eagle HDFS monitoring transforms audit log from Namenode to object and joins
sensitivity metadata, security zone metadata which are generated from external programs or
configured by user. Eagle hive monitoring filters running jobs to get hive query string and
parses query string into object and then joins sensitivity metadata.
+ '''Processing Engine:''' Eagle provides stream processing API which is an abstraction of
Apache Storm. It can also be extended to other streaming engines. This abstraction allows
developers to assemble data transformation, filtering, external data join etc. without physically
bound to a specific streaming platform. Eagle streaming API allows developers to easily integrate
business logic with Eagle policy engine and internally Eagle framework compiles business logic
execution DAG into program primitives of underlying stream infrastructure e.g. Apache Storm.
For example, Eagle HDFS monitoring transforms audit log from Namenode to object and joins
sensitivity metadata, security zone metadata which are generated from external programs or
configured by user. Eagle hive monitoring filters running jobs to get hive query string and
parses query string into object and then joins sensitivity metadata.
  
- Alerting Framework: Eagle Alert Framework includes stream metadata API, scalable policy
engine framework, extensible policy engine framework. Stream metadata API allows developers
to declare event schema including what attributes constitute an event, what is the type for
each attribute, and how to dynamically resolve attribute value in runtime when user configures
policy. Scalable policy engine framework allows policies to be executed on different physical
nodes in parallel. It is also used to define your own policy partitioner class. Policy engine
framework together with streaming partitioning capability provided by all streaming platforms
will make sure policies and events can be evaluated in a fully distributed way.
+ '''Alerting Framework:''' Eagle Alert Framework includes stream metadata API, scalable policy
engine framework, extensible policy engine framework. Stream metadata API allows developers
to declare event schema including what attributes constitute an event, what is the type for
each attribute, and how to dynamically resolve attribute value in runtime when user configures
policy. Scalable policy engine framework allows policies to be executed on different physical
nodes in parallel. It is also used to define your own policy partitioner class. Policy engine
framework together with streaming partitioning capability provided by all streaming platforms
will make sure policies and events can be evaluated in a fully distributed way.
  
  Extensible policy engine framework allows developer to plugin a new policy engine with a
few lines of codes. WSO2 Siddhi CEP engine is the policy engine which Eagle supports as first-class
citizen. 
  
- Machine Learning module: Eagle provides capabilities to define user activity patterns or
user profiles for Hadoop users based on the user behaviour in the platform. These user profiles
are modeled using Machine Learning algorithms and used for detection of anomalous users activities.
Eagle uses Eigen Value Decomposition, and Density Estimation algorithms for generating user
profile models. The model reads data from HDFS audit logs, preprocesses and aggregates data,
and generates models using Spark programming APIs. Once models are generated, Eagle uses stream
processing engine for near real-time anomaly detection to determine if any user’s activities
are suspicious or not.
+ '''Machine Learning module:''' Eagle provides capabilities to define user activity patterns
or user profiles for Hadoop users based on the user behaviour in the platform. These user
profiles are modeled using Machine Learning algorithms and used for detection of anomalous
users activities. Eagle uses Eigen Value Decomposition, and Density Estimation algorithms
for generating user profile models. The model reads data from HDFS audit logs, preprocesses
and aggregates data, and generates models using Spark programming APIs. Once models are generated,
Eagle uses stream processing engine for near real-time anomaly detection to determine if any
user’s activities are suspicious or not.
  
  ==== Eagle Services: ====
  
  Policy Manager: Eagle policy manager provides UI and Restful API for user to define policy
with just a few clicks. It includes site management UI, policy editor, sensitivity metadata
import, HDFS or Hive sensitive resource browsing, alert dashboards etc.
  
  Query Service: Eagle provides SQL-like service API to support comprehensive computation
for huge set of data on the fly, for e.g. comprehensive filtering, aggregation, histogram,
sorting, top, arithmetical expression, pagination etc. HBase is the data storage which Eagle
supports as first-class citizen, relational database is supported as well. For HBase storage,
Eagle query framework compiles user provided SQL-like query into HBase native filter objects
and execute it through HBase coprocessor on the fly.
+ 
- Background  
+ == Background ==
  
  Data is one of the most important assets for today’s businesses, which makes data security
one of the top priorities of today’s enterprises. Hadoop is widely used across different
verticals as a big data repository to store this data in most modern enterprises. 
  
  At eBay we use hadoop platform extensively for our data processing needs. Our data in Hadoop
is becoming bigger and bigger as our user base is seeing an exponential growth. Today there
are variety of data sets available in Hadoop cluster for our users to consume. eBay has around
120 PB of data stored in HDFS across 6 different clusters and around 1800+ active hadoop users
consuming data thru Hive, HBase and mapreduce jobs everyday to build applications using this
data. With this astronomical growth of data there are also challenges in securing sensitive
data and monitoring the access to this sensitive data. Today in large organizations HDFS is
the defacto standard for storing big data. Data sets which includes and not limited to consumer
sentiment, social media data, customer segmentation, web clicks, sensor data, geo-location
and transaction data get stored in Hadoop for day to day business needs.  
  
- We at eBay want to make sure the sensitive data and data platforms are completely 	protected
from security breaches. So we partnered very closely with our Information Security team to
understand the requirements for Eagle to monitor sensitive data access on hadoop: 
+ We at eBay want to make sure the sensitive data and data platforms are completely protected
from security breaches. So we partnered very closely with our Information Security team to
understand the requirements for Eagle to monitor sensitive data access on hadoop: 
  
- Ability to identify and stop security threats in real time
+ * Ability to identify and stop security threats in real time
- Scale for big data (Support PB scale and Billions of events)
+ * Scale for big data (Support PB scale and Billions of events)
- Ability to create data access policies 
+ * Ability to create data access policies 
- Support multiple data sources like HDFS, HBase, Hive
+ * Support multiple data sources like HDFS, HBase, Hive
- Visualize alerts in real time
+ * Visualize alerts in real time
- Ability to block malicious access in real time
+ * Ability to block malicious access in real time
  
  We did not find any data access monitoring solution that available today and can provide
the features and functionality that we need to monitor the data access in the hadoop ecosystem
at our scale. Hence with an excellent team of world class developers and several users, we
have been able to bring Eagle into production as well as open source it.
  
- === Rationale ===
+ == Rationale ==
  In today’s world; data is an important asset for any company. Businesses are using data
extensively to create amazing experiences for users. Data has to be protected and access to
data should be secured from security breaches. Today Hadoop is not only used to store logs
but also stores financial data, sensitive data sets, geographical data, user click stream
data sets etc. which makes it more important to be protected from security breaches. To secure
a data platform there are multiple things that need to happen. One is having a strong access
control mechanism which today is provided by Apache Ranger and Apache Sentry. These tools
provide the ability to provide fine grain access control mechanism to data sets on hadoop.
But there is a big gap in terms of monitoring all the data access events and activities in
order to securing the hadoop data platform. Together with strong access control, perimeter
security and data access monitoring in place data in the hadoop clusters can be secured against
breaches. We looked around and found following:
   
  Existing data activity monitoring products are designed for traditional databases and data
warehouse.
@@ -79, +82 @@

  === Core Developers ===
  Eagle is currently being designed and developed by engineers from eBay Inc. – Edward Zhang,
Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri,
Arun Manoharan. All of these core developers have deep expertise in developing monitoring
products for the Hadoop ecosystem.
  
- === Alignment === 
+ == Alignment ==
  The ASF is a natural host for Eagle given that it is already the home of Hadoop, HBase,
Hive, Storm, Kafka, Spark and other emerging big data projects. Eagle leverages lot of Apache
open-source products. Eagle was designed to offer real time insights into sensitive data access
by actively monitoring the data access on various data sets in hadoop and an extensible alerting
framework with a powerful policy engine. Eagle compliments the existing Hadoop platform area
by providing a comprehensive monitoring and alerting solution for detecting sensitive data
access threats based on preset policies and machine learning models for user behaviour analysis.

  
- === Known Risks === 
+ == Known Risks == 
  
- ==== Orphaned Products ====
+ === Orphaned Products ===
  
  The core developers of Eagle team work full time on this project. There is no risk of Eagle
getting orphaned since eBay is extensively using it in their production Hadoop clusters and
have plans to go beyond hadoop. For example, currently there are 7 hadoop clusters and 2 of
them are being monitored using Hadoop Eagle in production. We have plans to extend it to all
hadoop clusters and eventually other data platforms. There are 10’s of policies onboarded
and actively monitored with plans to onboard more use case. We are very confident that every
hadoop cluster in the world will be monitored using Eagle for securing the hadoop ecosystem
by actively monitoring for data access on sensitive data. We plan to extend and diversify
this community further through Apache. We presented Eagle at the hadoop summit in china and
garnered interest from different companies who use hadoop extensively. 
  
- ==== Inexperience with Open Source ====
+ === Inexperience with Open Source ===
  The core developers are all active users and followers of open source. They are already
committers and contributors to the Eagle Github project. All have been involved with the source
code that has been released under an open source license, and several of them also have experience
developing code in an open source environment. Though the core set of Developers do not have
Apache Open Source experience, there are plans to onboard individuals with Apache open source
experience on to the project. Apache Kylin PMC members are also in the same ebay organization.
We work very closely with Apache Ranger committers and are looking forward to find meaningful
integrations to improve the security of hadoop platform.
  
- ==== Homogenous Developers ====
+ === Homogenous Developers ===
  The core developers are from eBay. Today the problem of monitoring data activities to find
and stop threats is a universal problem faced by all the businesses. Apache Incubation process
encourages an open and diverse meritocratic community. Eagle intends to make every possible
effort to build a diverse, vibrant and involved community and has already received substantial
interest from various organizations.
  
- ==== Reliance on Salaried Developers ====
+ === Reliance on Salaried Developers ===
  eBay invested in Eagle as the monitoring solution for Hadoop clusters and some of its key
engineers are working full time on the project. In addition, since there is a growing need
for securing sensitive data access we need a data activity monitoring solution for Hadoop,
we look forward to other Apache developers and researchers to contribute to the project. Additional
contributors, including Apache committers have plans to join this effort shortly. Also key
to addressing the risk associated with relying on Salaried developers from a single entity
is to increase the diversity of the contributors and actively lobby for Domain experts in
the security space to contribute. Eagle intends to do this.
+ 
- ==== Relationships with Other Apache Products ====
+ === Relationships with Other Apache Products ===
  Eagle has a strong relationship and dependency with Apache Hadoop, HBase, Spark, Kafka and
Storm. Being part of Apache’s Incubation community, could help with a closer collaboration
among these projects and as well as others.
  An Excessive Fascination with the Apache Brand
  Eagle is proposing to enter incubation at Apache in order to help efforts to diversify the
committer-base, not so much to capitalize on the Apache brand. The Eagle project is in production
use already inside eBay, but is not expected to be an eBay product for external customers.
As such, the Eagle project is not seeking to use the Apache brand as a marketing tool.
  
- === Documentation ===
+ == Documentation ===
  Information about Eagle can be found at https://github.com/eBay/Eagle. The following links
provide more information about Eagle in open source. Also for more information please refer
to http://goeagle.io
  
- === Initial Source ===
+ == Initial Source ==
  Eagle has been under development since 2014 by a team of engineers at eBay Inc. It is currently
hosted on Github.com under an Apache license 2.0 at https://github.com/eBay/Eagle. Once in
incubation we will be moving the code base to apache git library.
  
- === External Dependencies ===
+ == External Dependencies ==
  Eagle has the following external dependencies.
- Basic
+ ===== Basic =====
  JDK 1.7+
  Scala 2.10.4
  Apache Maven
@@ -120, +124 @@

  Apache Commons Math3 
  Jackson
  Siddhi CEP engine
- Hadoop
+ 
+ ===== Hadoop =====
  Apache Hadoop
  Apache HBase
  Apache Hive
  Apache Zookeeper
  Apache Curator
  
- Apache Spark
+ ===== Apache Spark =====
- Spark Core Library
+ * Spark Core Library
- REST Service
+ ===== REST Service =====
- Jersey
+ * Jersey
- Query
+ ===== Query =====
- Antlr
+ * Antlr
- Stream processing
+ ===== Stream processing =====
- Apache Storm
+ * Apache Storm
- Apache Kafka
+ * Apache Kafka
- Web
+ ===== Web =====
- AngularJS
+ * AngularJS
- jQuery
+ * jQuery
- Bootstrap V3
+ * Bootstrap V3
- Moment JS
+ * Moment JS
- Admin LTE
+ * Admin LTE
- html5shiv
+ * html5shiv
- respond
+ * respond
- Fastclick
+ * Fastclick
- Date Range Picker
+ * Date Range Picker
- Flot JS
+ * Flot JS
  
- ==== Cryptography ====
+ == Cryptography ==
  Eagle will eventually support encryption on the wire. This is not one of the initial goals,
and we do not expect Eagle to be a controlled export item due to the use of encryption. Eagle
supports but does not require the Kerberos authentication mechanism to access secured Hadoop
services.
  
- === Required Resources ===
+ == Required Resources ==
- ==== Mailing List ====
+ === Mailing List ===
  *eagle-private for private PMC discussions 
  *eagle-dev for developers
  *eagle-commits for all commits
  *eagle-users for all eagle users
  
- ==== Subversion Directory ====
+ === Subversion Directory ===
  *Git is the preferred source control system.
+ 
- *Issue Tracking
+ === Issue Tracking ===
- JIRA Eagle (Eagle)
+ * JIRA Eagle (Eagle)
+ 
- *Other Resources
+ === Other Resources ===
  The existing code already has unit tests so we will make use of existing Apache continuous
testing infrastructure. The resulting load should not be very large.
  
- === Initial Committers ===
+ == Initial Committers ==
- Seshu Adunuthula <sadunuthula at ebay dot com>
+ * Seshu Adunuthula <sadunuthula at ebay dot com>
- Arun Manoharan <armanoharan at ebay dot com>
+ * Arun Manoharan <armanoharan at ebay dot com>
- Edward Zhang <yonzhang at ebay dot com>
+ * Edward Zhang <yonzhang at ebay dot com>
- Hao Chen <hchen9 at ebay dot com>
+ * Hao Chen <hchen9 at ebay dot com>
- Chaitali Gupta <cgupta at ebay dot com>
+ * Chaitali Gupta <cgupta at ebay dot com>
- Libin Sun <libsun at ebay dot com>
+ * Libin Sun <libsun at ebay dot com>
- Jilin Jiang <jiljiang at ebay dot com>
+ * Jilin Jiang <jiljiang at ebay dot com>
- Qingwen Zhao <qingwzhao at ebay dot com>
+ * Qingwen Zhao <qingwzhao at ebay dot com>
- Hemanth Dendukuri <hdendukuri at ebay dot com>
+ * Hemanth Dendukuri <hdendukuri at ebay dot com>
- Senthil Kumar <senthilkumar at ebay dot com>
+ * Senthil Kumar <senthilkumar at ebay dot com>
- Tan Chen <tanchen at ebay dot com>
+ * Tan Chen <tanchen at ebay dot com>
  
- === Affiliations ===
+ == Affiliations ==
  The initial committers are employees of eBay Inc. 
  
- ==== Sponsors ==== 
+ == Sponsors ==
- ==== Champion ====
+ === Champion ===
  Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
  
- ==== Nominated Mentors ====
+ === Nominated Mentors ===
- Owen O’Malley < omalley at apache dot org > - Apache IPMC member, Hortonworks
+ * Owen O’Malley < omalley at apache dot org > - Apache IPMC member, Hortonworks
- Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
+ * Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
- Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, Hortonworks
+ * Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, Hortonworks
  
- ==== Sponsoring Entity ====
+ === Sponsoring Entity ===
  We are requesting the Incubator to sponsor this project. 
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message