incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoharan, Arun" <armanoha...@ebay.com>
Subject Re: [DISCUSS] Eagle incubator proposal
Date Tue, 20 Oct 2015 17:09:58 GMT
Hi Amareshwari,

Thank you very much. We definitely like to have you as a mentor.

I will edit the proposal and set up sometime with you.

Thanks,
Arun

On 10/20/15, 1:36 AM, "Amareshwari Sriramdasu" <amareshwari@apache.org>
wrote:

>I would like to volunteer as mentor and help the project, if you are
>looking for more mentors.
>
>Thanks
>Amareshwari
>
>On Mon, Oct 19, 2015 at 9:03 PM, Manoharan, Arun <armanoharan@ebay.com>
>wrote:
>
>> Hello Everyone,
>>
>> My name is Arun Manoharan. Currently a product manager in the Analytics
>> platform team at eBay Inc.
>>
>> I would like to start a discussion on Eagle and its joining the ASF as
>>an
>> incubation project.
>>
>> Eagle is a Monitoring solution for Hadoop to instantly identify access
>>to
>> sensitive data, recognize attacks, malicious activities and take
>>actions in
>> real time. Eagle supports a wide variety of policies on HDFS data and
>>Hive.
>> Eagle also provides machine learning models for detecting anomalous user
>> behavior in Hadoop.
>>
>> The proposal is available on the wiki here:
>> https://wiki.apache.org/incubator/EagleProposal
>>
>> The text of the proposal is also available at the end of this email.
>>
>> Thanks for your time and help.
>>
>> Thanks,
>> Arun
>>
>> <COPY of the proposal in text format>
>>
>> Eagle
>>
>> Abstract
>> Eagle is an Open Source Monitoring solution for Hadoop to instantly
>> identify access to sensitive data, recognize attacks, malicious
>>activities
>> in hadoop and take actions.
>>
>> Proposal
>> Eagle audits access to HDFS files, Hive and HBase tables in real time,
>> enforces policies defined on sensitive data access and alerts or blocks
>> user¹s access to that sensitive data in real time. Eagle also creates
>>user
>> profiles based on the typical access behaviour for HDFS and Hive and
>>sends
>> alerts when anomalous behaviour is detected. Eagle can also import
>> sensitive data information classified by external classification
>>engines to
>> help define its policies.
>>
>> Overview of Eagle
>> Eagle has 3 main parts.
>> 1.Data collection and storage - Eagle collects data from various hadoop
>> logs in real time using Kafka/Yarn API and uses HDFS and HBase for
>>storage.
>> 2.Data processing and policy engine - Eagle allows users to create
>> policies based on various metadata properties on HDFS, Hive and HBase
>>data.
>> 3.Eagle services - Eagle services include policy manager, query service
>> and the visualization component. Eagle provides intuitive user
>>interface to
>> administer Eagle and an alert dashboard to respond to real time alerts.
>>
>> Data Collection and Storage:
>> Eagle provides programming API for extending Eagle to integrate any data
>> source into Eagle policy evaluation framework. For example, Eagle hdfs
>> audit monitoring collects data from Kafka which is populated from
>>namenode
>> log4j appender or from logstash agent. Eagle hive monitoring collects
>>hive
>> query logs from running job through YARN API, which is designed to be
>> scalable and fault-tolerant. Eagle uses HBase as storage for storing
>> metadata and metrics data, and also supports relational database through
>> configuration change.
>>
>> Data Processing and Policy Engine:
>> Processing Engine: Eagle provides stream processing API which is an
>> abstraction of Apache Storm. It can also be extended to other streaming
>> engines. This abstraction allows developers to assemble data
>> transformation, filtering, external data join etc. without physically
>>bound
>> to a specific streaming platform. Eagle streaming API allows developers
>>to
>> easily integrate business logic with Eagle policy engine and internally
>> Eagle framework compiles business logic execution DAG into program
>> primitives of underlying stream infrastructure e.g. Apache Storm. For
>> example, Eagle HDFS monitoring transforms audit log from Namenode to
>>object
>> and joins sensitivity metadata, security zone metadata which are
>>generated
>> from external programs or configured by user. Eagle hive monitoring
>>filters
>> running jobs to get hive query string and parses query string into
>>object
>> and then joins sensitivity metadata.
>> Alerting Framework: Eagle Alert Framework includes stream metadata API,
>> scalable policy engine framework, extensible policy engine framework.
>> Stream metadata API allows developers to declare event schema including
>> what attributes constitute an event, what is the type for each
>>attribute,
>> and how to dynamically resolve attribute value in runtime when user
>> configures policy. Scalable policy engine framework allows policies to
>>be
>> executed on different physical nodes in parallel. It is also used to
>>define
>> your own policy partitioner class. Policy engine framework together with
>> streaming partitioning capability provided by all streaming platforms
>>will
>> make sure policies and events can be evaluated in a fully distributed
>>way.
>> Extensible policy engine framework allows developer to plugin a new
>>policy
>> engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy
>> engine which Eagle supports as first-class citizen.
>> Machine Learning module: Eagle provides capabilities to define user
>> activity patterns or user profiles for Hadoop users based on the user
>> behaviour in the platform. These user profiles are modeled using Machine
>> Learning algorithms and used for detection of anomalous users
>>activities.
>> Eagle uses Eigen Value Decomposition, and Density Estimation algorithms
>>for
>> generating user profile models. The model reads data from HDFS audit
>>logs,
>> preprocesses and aggregates data, and generates models using Spark
>> programming APIs. Once models are generated, Eagle uses stream
>>processing
>> engine for near real-time anomaly detection to determine if any user¹s
>> activities are suspicious or not.
>>
>> Eagle Services:
>> Query Service: Eagle provides SQL-like service API to support
>> comprehensive computation for huge set of data on the fly, for e.g.
>> comprehensive filtering, aggregation, histogram, sorting, top,
>>arithmetical
>> expression, pagination etc. HBase is the data storage which Eagle
>>supports
>> as first-class citizen, relational database is supported as well. For
>>HBase
>> storage, Eagle query framework compiles user provided SQL-like query
>>into
>> HBase native filter objects and execute it through HBase coprocessor on
>>the
>> fly.
>> Policy Manager: Eagle policy manager provides UI and Restful API for
>>user
>> to define policy with just a few clicks. It includes site management UI,
>> policy editor, sensitivity metadata import, HDFS or Hive sensitive
>>resource
>> browsing, alert dashboards etc.
>> Background
>> Data is one of the most important assets for today¹s businesses, which
>> makes data security one of the top priorities of today¹s enterprises.
>> Hadoop is widely used across different verticals as a big data
>>repository
>> to store this data in most modern enterprises.
>> At eBay we use hadoop platform extensively for our data processing
>>needs.
>> Our data in Hadoop is becoming bigger and bigger as our user base is
>>seeing
>> an exponential growth. Today there are variety of data sets available in
>> Hadoop cluster for our users to consume. eBay has around 120 PB of data
>> stored in HDFS across 6 different clusters and around 1800+ active
>>hadoop
>> users consuming data thru Hive, HBase and mapreduce jobs everyday to
>>build
>> applications using this data. With this astronomical growth of data
>>there
>> are also challenges in securing sensitive data and monitoring the
>>access to
>> this sensitive data. Today in large organizations HDFS is the defacto
>> standard for storing big data. Data sets which includes and not limited
>>to
>> consumer sentiment, social media data, customer segmentation, web
>>clicks,
>> sensor data, geo-location and transaction data get stored in Hadoop for
>>day
>> to day business needs.
>> We at eBay want to make sure the sensitive data and data platforms are
>> completely protected from security breaches. So we partnered very
>>closely
>> with our Information Security team to understand the requirements for
>>Eagle
>> to monitor sensitive data access on hadoop:
>> 1.Ability to identify and stop security threats in real time
>> 2.Scale for big data (Support PB scale and Billions of events)
>> 3.Ability to create data access policies
>> 4.Support multiple data sources like HDFS, HBase, Hive
>> 5.Visualize alerts in real time
>> 6.Ability to block malicious access in real time
>> We did not find any data access monitoring solution that available today
>> and can provide the features and functionality that we need to monitor
>>the
>> data access in the hadoop ecosystem at our scale. Hence with an
>>excellent
>> team of world class developers and several users, we have been able to
>> bring Eagle into production as well as open source it.
>>
>> Rationale
>> In today¹s world; data is an important asset for any company. Businesses
>> are using data extensively to create amazing experiences for users. Data
>> has to be protected and access to data should be secured from security
>> breaches. Today Hadoop is not only used to store logs but also stores
>> financial data, sensitive data sets, geographical data, user click
>>stream
>> data sets etc. which makes it more important to be protected from
>>security
>> breaches. To secure a data platform there are multiple things that need
>>to
>> happen. One is having a strong access control mechanism which today is
>> provided by Apache Ranger and Apache Sentry. These tools provide the
>> ability to provide fine grain access control mechanism to data sets on
>> hadoop. But there is a big gap in terms of monitoring all the data
>>access
>> events and activities in order to securing the hadoop data platform.
>> Together with strong access control, perimeter security and data access
>> monitoring in place data in the hadoop clusters can be secured against
>> breaches. We looked around and found following:
>> Existing data activity monitoring products are designed for traditional
>> databases and data warehouse. Existing monitoring platforms cannot scale
>> out to support fast growing data and petabyte scale. Few products in the
>> industry are still very early in terms of supporting HDFS, Hive, HBase
>>data
>> access monitoring.
>> As mentioned in the background, the business requirement and urgency to
>> secure the data from users with malicious intent drove eBay to invest in
>> building a real time data access monitoring solution from scratch to
>>offer
>> real time alerts and remediation features for malicious data access.
>> With the power of open source distributed systems like Hadoop, Kafka and
>> much more we were able to develop a data activity monitoring system that
>> can scale, identify and stop malicious access in real time.
>> Eagle allows admins to create standard access policies and rules for
>> monitoring HDFS, Hive and HBase data. Eagle also provides out of box
>> machine learning models for modeling user profiles based on user access
>> behaviour and use the model to alert on anomalies.
>>
>> Current Status
>>
>> Meritocracy
>> Eagle has been deployed in production at eBay for monitoring billions of
>> events per day from HDFS and Hive operations. From the start; the
>>product
>> has been built with focus on high scalability and application
>>extensibility
>> in mind and Eagle has demonstrated great performance in responding to
>> suspicious events instantly and great flexibility in defining policy.
>>
>> Community
>> Eagle seeks to develop the developer and user communities during
>> incubation.
>>
>> Core Developers
>> Eagle is currently being designed and developed by engineers from eBay
>> Inc. ­ Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
>> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
>> these core developers have deep expertise in developing monitoring
>>products
>> for the Hadoop ecosystem.
>>
>> Alignment
>> The ASF is a natural host for Eagle given that it is already the home of
>> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
>> projects. Eagle leverages lot of Apache open-source products. Eagle was
>> designed to offer real time insights into sensitive data access by
>>actively
>> monitoring the data access on various data sets in hadoop and an
>>extensible
>> alerting framework with a powerful policy engine. Eagle compliments the
>> existing Hadoop platform area by providing a comprehensive monitoring
>>and
>> alerting solution for detecting sensitive data access threats based on
>> preset policies and machine learning models for user behaviour analysis.
>>
>> Known Risks
>>
>> Orphaned Products
>> The core developers of Eagle team work full time on this project. There
>>is
>> no risk of Eagle getting orphaned since eBay is extensively using it in
>> their production Hadoop clusters and have plans to go beyond hadoop. For
>> example, currently there are 7 hadoop clusters and 2 of them are being
>> monitored using Hadoop Eagle in production. We have plans to extend it
>>to
>> all hadoop clusters and eventually other data platforms. There are 10¹s
>>of
>> policies onboarded and actively monitored with plans to onboard more use
>> case. We are very confident that every hadoop cluster in the world will
>>be
>> monitored using Eagle for securing the hadoop ecosystem by actively
>> monitoring for data access on sensitive data. We plan to extend and
>> diversify this community further through Apache. We presented Eagle at
>>the
>> hadoop summit in china and garnered interest from different companies
>>who
>> use hadoop extensively.
>>
>> Inexperience with Open Source
>> The core developers are all active users and followers of open source.
>> They are already committers and contributors to the Eagle Github
>>project.
>> All have been involved with the source code that has been released
>>under an
>> open source license, and several of them also have experience developing
>> code in an open source environment. Though the core set of Developers do
>> not have Apache Open Source experience, there are plans to onboard
>> individuals with Apache open source experience on to the project. Apache
>> Kylin PMC members are also in the same ebay organization. We work very
>> closely with Apache Ranger committers and are looking forward to find
>> meaningful integrations to improve the security of hadoop platform.
>>
>> Homogenous Developers
>> The core developers are from eBay. Today the problem of monitoring data
>> activities to find and stop threats is a universal problem faced by all
>>the
>> businesses. Apache Incubation process encourages an open and diverse
>> meritocratic community. Eagle intends to make every possible effort to
>> build a diverse, vibrant and involved community and has already received
>> substantial interest from various organizations.
>>
>> Reliance on Salaried Developers
>> eBay invested in Eagle as the monitoring solution for Hadoop clusters
>>and
>> some of its key engineers are working full time on the project. In
>> addition, since there is a growing need for securing sensitive data
>>access
>> we need a data activity monitoring solution for Hadoop, we look forward
>>to
>> other Apache developers and researchers to contribute to the project.
>> Additional contributors, including Apache committers have plans to join
>> this effort shortly. Also key to addressing the risk associated with
>> relying on Salaried developers from a single entity is to increase the
>> diversity of the contributors and actively lobby for Domain experts in
>>the
>> security space to contribute. Eagle intends to do this.
>>
>> Relationships with Other Apache Products
>> Eagle has a strong relationship and dependency with Apache Hadoop,
>>HBase,
>> Spark, Kafka and Storm. Being part of Apache¹s Incubation community,
>>could
>> help with a closer collaboration among these projects and as well as
>> others. An Excessive Fascination with the Apache Brand Eagle is
>>proposing
>> to enter incubation at Apache in order to help efforts to diversify the
>> committer-base, not so much to capitalize on the Apache brand. The Eagle
>> project is in production use already inside eBay, but is not expected
>>to be
>> an eBay product for external customers. As such, the Eagle project is
>>not
>> seeking to use the Apache brand as a marketing tool.
>>
>> Documentation
>> Information about Eagle can be found at https://github.com/eBay/Eagle.
>> The following link provide more information about Eagle
>>http://goeagle.io.
>>
>> Initial Source
>> Eagle has been under development since 2014 by a team of engineers at
>>eBay
>> Inc. It is currently hosted on Github.com under an Apache license 2.0 at
>> https://github.com/eBay/Eagle. Once in incubation we will be moving the
>> code base to apache git library.
>>
>> External Dependencies
>> Eagle has the following external dependencies.
>> Basic
>> €JDK 1.7+
>> €Scala 2.10.4
>> €Apache Maven
>> €JUnit
>> €Log4j
>> €Slf4j
>> €Apache Commons
>> €Apache Commons Math3
>> €Jackson
>> €Siddhi CEP engine
>>
>> Hadoop
>> €Apache Hadoop
>> €Apache HBase
>> €Apache Hive
>> €Apache Zookeeper
>> €Apache Curator
>>
>> Apache Spark
>> €Spark Core Library
>>
>> REST Service
>> €Jersey
>>
>> Query
>> €Antlr
>>
>> Stream processing
>> €Apache Storm
>> €Apache Kafka
>>
>> Web
>> €AngularJS
>> €jQuery
>> €Bootstrap V3
>> €Moment JS
>> €Admin LTE
>> €html5shiv
>> €respond
>> €Fastclick
>> €Date Range Picker
>> €Flot JS
>>
>> Cryptography
>> Eagle will eventually support encryption on the wire. This is not one of
>> the initial goals, and we do not expect Eagle to be a controlled export
>> item due to the use of encryption. Eagle supports but does not require
>>the
>> Kerberos authentication mechanism to access secured Hadoop services.
>>
>> Required Resources
>>
>> Mailing List
>> €eagle-private for private PMC discussions
>> €eagle-dev for developers
>> €eagle-commits for all commits
>> €eagle-users for all eagle users
>>
>> Subversion Directory
>> €Git is the preferred source control system.
>>
>> Issue Tracking
>> €JIRA Eagle (Eagle)
>>
>> Other Resources
>> The existing code already has unit tests so we will make use of existing
>> Apache continuous testing infrastructure. The resulting load should not
>>be
>> very large.
>>
>> Initial Committers
>> €Seshu Adunuthula <sadunuthula at ebay dot com>
>> €Arun Manoharan <armanoharan at ebay dot com>
>> €Edward Zhang <yonzhang at ebay dot com>
>> €Hao Chen <hchen9 at ebay dot com>
>> €Chaitali Gupta <cgupta at ebay dot com>
>> €Libin Sun <libsun at ebay dot com>
>> €Jilin Jiang <jiljiang at ebay dot com>
>> €Qingwen Zhao <qingwzhao at ebay dot com>
>> €Hemanth Dendukuri <hdendukuri at ebay dot com>
>> €Senthil Kumar <senthilkumar at ebay dot com>
>> €Tan Chen <tanchen at ebay dot com>
>>
>> Affiliations
>> The initial committers are employees of eBay Inc.
>>
>> Sponsors
>>
>> Champion
>> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>
>> Nominated Mentors
>> €Owen O¹Malley < omalley at apache dot org > - Apache IPMC member,
>> Hortonworks
>> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>> €Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
>> Hortonworks
>>
>> Sponsoring Entity
>> We are requesting the Incubator to sponsor this project.
>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message