incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [DISCUSS] Eagle incubator proposal
Date Tue, 20 Oct 2015 05:46:17 GMT
It makes sense. I will try to contribute on this ;)

Regards
JB

On 10/19/2015 09:46 PM, Zhang, Edward (GDI Hadoop) wrote:
> Hi JB,
>
> That is a good Point. Good to know that Falcon feeds HDFS/Hive/HBase data
> changes, so this feature would complement Eagle which today mainly focuses
> on HDFS/Hive/HBase data access including view, change, delete etc. Eagle
> would benefit if Eagle can instantly capture data change from Falcon.
>
> Thanks
> Edward Zhang
>
>
>
> On 10/19/15, 8:40, "Jean-Baptiste Onofré" <jb@nanthrax.net> wrote:
>
>> Hi Arun,
>>
>> very interesting proposal. I may see some possible interaction with
>> Falcon. In Falcon, we have HDFS files (and Hive/HBase) monitoring (with
>> a kind of Change Data Capture), etc.
>>
>> So, I see a different perspective in Eagle, but Eagle could also
>> leverage Falcon somehow.
>>
>> Regards
>> JB
>>
>> On 10/19/2015 05:33 PM, Manoharan, Arun wrote:
>>> Hello Everyone,
>>>
>>> My name is Arun Manoharan. Currently a product manager in the Analytics
>>> platform team at eBay Inc.
>>>
>>> I would like to start a discussion on Eagle and its joining the ASF as
>>> an incubation project.
>>>
>>> Eagle is a Monitoring solution for Hadoop to instantly identify access
>>> to sensitive data, recognize attacks, malicious activities and take
>>> actions in real time. Eagle supports a wide variety of policies on HDFS
>>> data and Hive. Eagle also provides machine learning models for detecting
>>> anomalous user behavior in Hadoop.
>>>
>>> The proposal is available on the wiki here:
>>> https://wiki.apache.org/incubator/EagleProposal
>>>
>>> The text of the proposal is also available at the end of this email.
>>>
>>> Thanks for your time and help.
>>>
>>> Thanks,
>>> Arun
>>>
>>> <COPY of the proposal in text format>
>>>
>>> Eagle
>>>
>>> Abstract
>>> Eagle is an Open Source Monitoring solution for Hadoop to instantly
>>> identify access to sensitive data, recognize attacks, malicious
>>> activities in hadoop and take actions.
>>>
>>> Proposal
>>> Eagle audits access to HDFS files, Hive and HBase tables in real time,
>>> enforces policies defined on sensitive data access and alerts or blocks
>>> user¹s access to that sensitive data in real time. Eagle also creates
>>> user profiles based on the typical access behaviour for HDFS and Hive
>>> and sends alerts when anomalous behaviour is detected. Eagle can also
>>> import sensitive data information classified by external classification
>>> engines to help define its policies.
>>>
>>> Overview of Eagle
>>> Eagle has 3 main parts.
>>> 1.Data collection and storage - Eagle collects data from various hadoop
>>> logs in real time using Kafka/Yarn API and uses HDFS and HBase for
>>> storage.
>>> 2.Data processing and policy engine - Eagle allows users to create
>>> policies based on various metadata properties on HDFS, Hive and HBase
>>> data.
>>> 3.Eagle services - Eagle services include policy manager, query service
>>> and the visualization component. Eagle provides intuitive user interface
>>> to administer Eagle and an alert dashboard to respond to real time
>>> alerts.
>>>
>>> Data Collection and Storage:
>>> Eagle provides programming API for extending Eagle to integrate any
>>> data source into Eagle policy evaluation framework. For example, Eagle
>>> hdfs audit monitoring collects data from Kafka which is populated from
>>> namenode log4j appender or from logstash agent. Eagle hive monitoring
>>> collects hive query logs from running job through YARN API, which is
>>> designed to be scalable and fault-tolerant. Eagle uses HBase as storage
>>> for storing metadata and metrics data, and also supports relational
>>> database through configuration change.
>>>
>>> Data Processing and Policy Engine:
>>> Processing Engine: Eagle provides stream processing API which is an
>>> abstraction of Apache Storm. It can also be extended to other streaming
>>> engines. This abstraction allows developers to assemble data
>>> transformation, filtering, external data join etc. without physically
>>> bound to a specific streaming platform. Eagle streaming API allows
>>> developers to easily integrate business logic with Eagle policy engine
>>> and internally Eagle framework compiles business logic execution DAG
>>> into program primitives of underlying stream infrastructure e.g. Apache
>>> Storm. For example, Eagle HDFS monitoring transforms audit log from
>>> Namenode to object and joins sensitivity metadata, security zone
>>> metadata which are generated from external programs or configured by
>>> user. Eagle hive monitoring filters running jobs to get hive query
>>> string and parses query string into object and then joins sensitivity
>>> metadata.
>>> Alerting Framework: Eagle Alert Framework includes stream metadata API,
>>> scalable policy engine framework, extensible policy engine framework.
>>> Stream metadata API allows developers to declare event schema including
>>> what attributes constitute an event, what is the type for each
>>> attribute, and how to dynamically resolve attribute value in runtime
>>> when user configures policy. Scalable policy engine framework allows
>>> policies to be executed on different physical nodes in parallel. It is
>>> also used to define your own policy partitioner class. Policy engine
>>> framework together with streaming partitioning capability provided by
>>> all streaming platforms will make sure policies and events can be
>>> evaluated in a fully distributed way. Extensible policy engine framework
>>> allows developer to plugin a new policy engine with a few lines of
>>> codes. WSO2 Siddhi CEP engine is the policy engine which Eagle supports
>>> as first-class citizen.
>>> Machine Learning module: Eagle provides capabilities to define user
>>> activity patterns or user profiles for Hadoop users based on the user
>>> behaviour in the platform. These user profiles are modeled using Machine
>>> Learning algorithms and used for detection of anomalous users
>>> activities. Eagle uses Eigen Value Decomposition, and Density Estimation
>>> algorithms for generating user profile models. The model reads data from
>>> HDFS audit logs, preprocesses and aggregates data, and generates models
>>> using Spark programming APIs. Once models are generated, Eagle uses
>>> stream processing engine for near real-time anomaly detection to
>>> determine if any user¹s activities are suspicious or not.
>>>
>>> Eagle Services:
>>> Query Service: Eagle provides SQL-like service API to support
>>> comprehensive computation for huge set of data on the fly, for e.g.
>>> comprehensive filtering, aggregation, histogram, sorting, top,
>>> arithmetical expression, pagination etc. HBase is the data storage which
>>> Eagle supports as first-class citizen, relational database is supported
>>> as well. For HBase storage, Eagle query framework compiles user provided
>>> SQL-like query into HBase native filter objects and execute it through
>>> HBase coprocessor on the fly.
>>> Policy Manager: Eagle policy manager provides UI and Restful API for
>>> user to define policy with just a few clicks. It includes site
>>> management UI, policy editor, sensitivity metadata import, HDFS or Hive
>>> sensitive resource browsing, alert dashboards etc.
>>> Background
>>> Data is one of the most important assets for today¹s businesses, which
>>> makes data security one of the top priorities of today¹s enterprises.
>>> Hadoop is widely used across different verticals as a big data
>>> repository to store this data in most modern enterprises.
>>> At eBay we use hadoop platform extensively for our data processing
>>> needs. Our data in Hadoop is becoming bigger and bigger as our user base
>>> is seeing an exponential growth. Today there are variety of data sets
>>> available in Hadoop cluster for our users to consume. eBay has around
>>> 120 PB of data stored in HDFS across 6 different clusters and around
>>> 1800+ active hadoop users consuming data thru Hive, HBase and mapreduce
>>> jobs everyday to build applications using this data. With this
>>> astronomical growth of data there are also challenges in securing
>>> sensitive data and monitoring the access to this sensitive data. Today
>>> in large organizations HDFS is the defacto standard for storing big
>>> data. Data sets which includes and not limited to consumer sentiment,
>>> social media data, customer segmentation, web clicks, sensor data,
>>> geo-location and transaction data get stored in Hadoop for day to day
>>> business needs.
>>> We at eBay want to make sure the sensitive data and data platforms are
>>> completely protected from security breaches. So we partnered very
>>> closely with our Information Security team to understand the
>>> requirements for Eagle to monitor sensitive data access on hadoop:
>>> 1.Ability to identify and stop security threats in real time
>>> 2.Scale for big data (Support PB scale and Billions of events)
>>> 3.Ability to create data access policies
>>> 4.Support multiple data sources like HDFS, HBase, Hive
>>> 5.Visualize alerts in real time
>>> 6.Ability to block malicious access in real time
>>> We did not find any data access monitoring solution that available
>>> today and can provide the features and functionality that we need to
>>> monitor the data access in the hadoop ecosystem at our scale. Hence with
>>> an excellent team of world class developers and several users, we have
>>> been able to bring Eagle into production as well as open source it.
>>>
>>> Rationale
>>> In today¹s world; data is an important asset for any company.
>>> Businesses are using data extensively to create amazing experiences for
>>> users. Data has to be protected and access to data should be secured
>> >from security breaches. Today Hadoop is not only used to store logs but
>>> also stores financial data, sensitive data sets, geographical data, user
>>> click stream data sets etc. which makes it more important to be
>>> protected from security breaches. To secure a data platform there are
>>> multiple things that need to happen. One is having a strong access
>>> control mechanism which today is provided by Apache Ranger and Apache
>>> Sentry. These tools provide the ability to provide fine grain access
>>> control mechanism to data sets on hadoop. But there is a big gap in
>>> terms of monitoring all the data access events and activities in order
>>> to securing the hadoop data platform. Together with strong access
>>> control, perimeter security and data access monitoring in place data in
>>> the hadoop clusters can be secu
>> r
>> ed against breaches. We looked around and found following:
>>> Existing data activity monitoring products are designed for traditional
>>> databases and data warehouse. Existing monitoring platforms cannot scale
>>> out to support fast growing data and petabyte scale. Few products in the
>>> industry are still very early in terms of supporting HDFS, Hive, HBase
>>> data access monitoring.
>>> As mentioned in the background, the business requirement and urgency to
>>> secure the data from users with malicious intent drove eBay to invest in
>>> building a real time data access monitoring solution from scratch to
>>> offer real time alerts and remediation features for malicious data
>>> access.
>>> With the power of open source distributed systems like Hadoop, Kafka
>>> and much more we were able to develop a data activity monitoring system
>>> that can scale, identify and stop malicious access in real time.
>>> Eagle allows admins to create standard access policies and rules for
>>> monitoring HDFS, Hive and HBase data. Eagle also provides out of box
>>> machine learning models for modeling user profiles based on user access
>>> behaviour and use the model to alert on anomalies.
>>>
>>> Current Status
>>>
>>> Meritocracy
>>> Eagle has been deployed in production at eBay for monitoring billions
>>> of events per day from HDFS and Hive operations. From the start; the
>>> product has been built with focus on high scalability and application
>>> extensibility in mind and Eagle has demonstrated great performance in
>>> responding to suspicious events instantly and great flexibility in
>>> defining policy.
>>>
>>> Community
>>> Eagle seeks to develop the developer and user communities during
>>> incubation.
>>>
>>> Core Developers
>>> Eagle is currently being designed and developed by engineers from eBay
>>> Inc. ­ Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
>>> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
>>> these core developers have deep expertise in developing monitoring
>>> products for the Hadoop ecosystem.
>>>
>>> Alignment
>>> The ASF is a natural host for Eagle given that it is already the home
>>> of Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
>>> projects. Eagle leverages lot of Apache open-source products. Eagle was
>>> designed to offer real time insights into sensitive data access by
>>> actively monitoring the data access on various data sets in hadoop and
>>> an extensible alerting framework with a powerful policy engine. Eagle
>>> compliments the existing Hadoop platform area by providing a
>>> comprehensive monitoring and alerting solution for detecting sensitive
>>> data access threats based on preset policies and machine learning models
>>> for user behaviour analysis.
>>>
>>> Known Risks
>>>
>>> Orphaned Products
>>> The core developers of Eagle team work full time on this project. There
>>> is no risk of Eagle getting orphaned since eBay is extensively using it
>>> in their production Hadoop clusters and have plans to go beyond hadoop.
>>> For example, currently there are 7 hadoop clusters and 2 of them are
>>> being monitored using Hadoop Eagle in production. We have plans to
>>> extend it to all hadoop clusters and eventually other data platforms.
>>> There are 10¹s of policies onboarded and actively monitored with plans
>>> to onboard more use case. We are very confident that every hadoop
>>> cluster in the world will be monitored using Eagle for securing the
>>> hadoop ecosystem by actively monitoring for data access on sensitive
>>> data. We plan to extend and diversify this community further through
>>> Apache. We presented Eagle at the hadoop summit in china and garnered
>>> interest from different companies who use hadoop extensively.
>>>
>>> Inexperience with Open Source
>>> The core developers are all active users and followers of open source.
>>> They are already committers and contributors to the Eagle Github
>>> project. All have been involved with the source code that has been
>>> released under an open source license, and several of them also have
>>> experience developing code in an open source environment. Though the
>>> core set of Developers do not have Apache Open Source experience, there
>>> are plans to onboard individuals with Apache open source experience on
>>> to the project. Apache Kylin PMC members are also in the same ebay
>>> organization. We work very closely with Apache Ranger committers and are
>>> looking forward to find meaningful integrations to improve the security
>>> of hadoop platform.
>>>
>>> Homogenous Developers
>>> The core developers are from eBay. Today the problem of monitoring data
>>> activities to find and stop threats is a universal problem faced by all
>>> the businesses. Apache Incubation process encourages an open and diverse
>>> meritocratic community. Eagle intends to make every possible effort to
>>> build a diverse, vibrant and involved community and has already received
>>> substantial interest from various organizations.
>>>
>>> Reliance on Salaried Developers
>>> eBay invested in Eagle as the monitoring solution for Hadoop clusters
>>> and some of its key engineers are working full time on the project. In
>>> addition, since there is a growing need for securing sensitive data
>>> access we need a data activity monitoring solution for Hadoop, we look
>>> forward to other Apache developers and researchers to contribute to the
>>> project. Additional contributors, including Apache committers have plans
>>> to join this effort shortly. Also key to addressing the risk associated
>>> with relying on Salaried developers from a single entity is to increase
>>> the diversity of the contributors and actively lobby for Domain experts
>>> in the security space to contribute. Eagle intends to do this.
>>>
>>> Relationships with Other Apache Products
>>> Eagle has a strong relationship and dependency with Apache Hadoop,
>>> HBase, Spark, Kafka and Storm. Being part of Apache¹s Incubation
>>> community, could help with a closer collaboration among these projects
>>> and as well as others. An Excessive Fascination with the Apache Brand
>>> Eagle is proposing to enter incubation at Apache in order to help
>>> efforts to diversify the committer-base, not so much to capitalize on
>>> the Apache brand. The Eagle project is in production use already inside
>>> eBay, but is not expected to be an eBay product for external customers.
>>> As such, the Eagle project is not seeking to use the Apache brand as a
>>> marketing tool.
>>>
>>> Documentation
>>> Information about Eagle can be found at https://github.com/eBay/Eagle.
>>> The following link provide more information about Eagle
>>> http://goeagle.io.
>>>
>>> Initial Source
>>> Eagle has been under development since 2014 by a team of engineers at
>>> eBay Inc. It is currently hosted on Github.com under an Apache license
>>> 2.0 at https://github.com/eBay/Eagle. Once in incubation we will be
>>> moving the code base to apache git library.
>>>
>>> External Dependencies
>>> Eagle has the following external dependencies.
>>> Basic
>>> €JDK 1.7+
>>> €Scala 2.10.4
>>> €Apache Maven
>>> €JUnit
>>> €Log4j
>>> €Slf4j
>>> €Apache Commons
>>> €Apache Commons Math3
>>> €Jackson
>>> €Siddhi CEP engine
>>>
>>> Hadoop
>>> €Apache Hadoop
>>> €Apache HBase
>>> €Apache Hive
>>> €Apache Zookeeper
>>> €Apache Curator
>>>
>>> Apache Spark
>>> €Spark Core Library
>>>
>>> REST Service
>>> €Jersey
>>>
>>> Query
>>> €Antlr
>>>
>>> Stream processing
>>> €Apache Storm
>>> €Apache Kafka
>>>
>>> Web
>>> €AngularJS
>>> €jQuery
>>> €Bootstrap V3
>>> €Moment JS
>>> €Admin LTE
>>> €html5shiv
>>> €respond
>>> €Fastclick
>>> €Date Range Picker
>>> €Flot JS
>>>
>>> Cryptography
>>> Eagle will eventually support encryption on the wire. This is not one
>>> of the initial goals, and we do not expect Eagle to be a controlled
>>> export item due to the use of encryption. Eagle supports but does not
>>> require the Kerberos authentication mechanism to access secured Hadoop
>>> services.
>>>
>>> Required Resources
>>>
>>> Mailing List
>>> €eagle-private for private PMC discussions
>>> €eagle-dev for developers
>>> €eagle-commits for all commits
>>> €eagle-users for all eagle users
>>>
>>> Subversion Directory
>>> €Git is the preferred source control system.
>>>
>>> Issue Tracking
>>> €JIRA Eagle (Eagle)
>>>
>>> Other Resources
>>> The existing code already has unit tests so we will make use of
>>> existing Apache continuous testing infrastructure. The resulting load
>>> should not be very large.
>>>
>>> Initial Committers
>>> €Seshu Adunuthula <sadunuthula at ebay dot com>
>>> €Arun Manoharan <armanoharan at ebay dot com>
>>> €Edward Zhang <yonzhang at ebay dot com>
>>> €Hao Chen <hchen9 at ebay dot com>
>>> €Chaitali Gupta <cgupta at ebay dot com>
>>> €Libin Sun <libsun at ebay dot com>
>>> €Jilin Jiang <jiljiang at ebay dot com>
>>> €Qingwen Zhao <qingwzhao at ebay dot com>
>>> €Hemanth Dendukuri <hdendukuri at ebay dot com>
>>> €Senthil Kumar <senthilkumar at ebay dot com>
>>> €Tan Chen <tanchen at ebay dot com>
>>>
>>> Affiliations
>>> The initial committers are employees of eBay Inc.
>>>
>>> Sponsors
>>>
>>> Champion
>>> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>>
>>> Nominated Mentors
>>> €Owen O¹Malley < omalley at apache dot org > - Apache IPMC member,
>>> Hortonworks
>>> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>> €Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
>>> Hortonworks
>>>
>>> Sponsoring Entity
>>> We are requesting the Incubator to sponsor this project.
>>>
>>>
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message