incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoharan, Arun" <>
Subject Re: [DISCUSS] Eagle incubator proposal
Date Wed, 21 Oct 2015 16:47:19 GMT
Hi Sid,

Thanks for your support.

Actually we have developed an Ambari plugin for Eagle where someone could
use Ambari to deploy Eagle. We have this working on the sandbox. Would
like to have you as a contributor. I will reach out to you.


On 10/21/15, 9:22 AM, "Siddharth Wagle" <> wrote:

>Hi Arun,
>This proposal looks great. I would like to be an active contributor on
>this project. I bring with me the experience of Apache Ambari and
>developing the Ambari Metrics System.
>Best Regards,
>From: Julian Hyde <>
>Sent: Wednesday, October 21, 2015 9:10 AM
>Subject: Re: [DISCUSS] Eagle incubator proposal
>My name is already on the list of mentors. I think this project fills an
>important need. Several of the initial committers were involved with
>Kylin and therefore know the Apache process.
>> On Oct 20, 2015, at 11:58 AM, P. Taylor Goetz <> wrote:
>> I should also have some improved bandwidth both now that Kylin is
>>nearing graduation and for other reasons. I¹ve been bogged down
>>recently, but that¹s starting to change.
>> If more mentors are desired, I¹d be willing to help in that respect.
>> -Taylor
>>> On Oct 20, 2015, at 11:49 AM, Henry Saputra <>
>>> Hi Ted,
>>> Since Kylin almost ready to graduate, I have more bandwidth to help
>>>with Eagle.
>>> But, you are right that current proposed mentors for Eagle seemed to
>>> be very busy with other podlings, so 1 or 2 additional mentors would
>>> be great.
>>> The good news is that the team consist some people from Kylin, for
>>> example Luke, which done great job helping Kylin to understand working
>>> with Apache way.
>>> So we have some help from initial committers who have done the rodeo
>>> - Henry
>>> On Mon, Oct 19, 2015 at 9:00 AM, Ted Dunning <>
>>>> I would suggest that Owen O'Malley has not had enough time to be a
>>>> mentor recently and should not be on the list of mentors.
>>>> Henry and Julian are good if their schedules permit.  Henry, I know
>>>> been mentoring a number of projects lately.
>>>> On Mon, Oct 19, 2015 at 8:40 AM, Jean-Baptiste Onofré
>>>> wrote:
>>>>> Hi Arun,
>>>>> very interesting proposal. I may see some possible interaction with
>>>>> Falcon. In Falcon, we have HDFS files (and Hive/HBase) monitoring
>>>>>(with a
>>>>> kind of Change Data Capture), etc.
>>>>> So, I see a different perspective in Eagle, but Eagle could also
>>>>> Falcon somehow.
>>>>> Regards
>>>>> JB
>>>>> On 10/19/2015 05:33 PM, Manoharan, Arun wrote:
>>>>>> Hello Everyone,
>>>>>> My name is Arun Manoharan. Currently a product manager in the
>>>>>> platform team at eBay Inc.
>>>>>> I would like to start a discussion on Eagle and its joining the ASF
>>>>>>as an
>>>>>> incubation project.
>>>>>> Eagle is a Monitoring solution for Hadoop to instantly identify
>>>>>>access to
>>>>>> sensitive data, recognize attacks, malicious activities and take
>>>>>>actions in
>>>>>> real time. Eagle supports a wide variety of policies on HDFS data
>>>>>>and Hive.
>>>>>> Eagle also provides machine learning models for detecting anomalous
>>>>>> behavior in Hadoop.
>>>>>> The proposal is available on the wiki here:
>>>>>> The text of the proposal is also available at the end of this email.
>>>>>> Thanks for your time and help.
>>>>>> Thanks,
>>>>>> Arun
>>>>>> <COPY of the proposal in text format>
>>>>>> Eagle
>>>>>> Abstract
>>>>>> Eagle is an Open Source Monitoring solution for Hadoop to instantly
>>>>>> identify access to sensitive data, recognize attacks, malicious
>>>>>> in hadoop and take actions.
>>>>>> Proposal
>>>>>> Eagle audits access to HDFS files, Hive and HBase tables in real
>>>>>> enforces policies defined on sensitive data access and alerts or
>>>>>> user¹s access to that sensitive data in real time. Eagle also
>>>>>>creates user
>>>>>> profiles based on the typical access behaviour for HDFS and Hive
>>>>>>and sends
>>>>>> alerts when anomalous behaviour is detected. Eagle can also import
>>>>>> sensitive data information classified by external classification
>>>>>>engines to
>>>>>> help define its policies.
>>>>>> Overview of Eagle
>>>>>> Eagle has 3 main parts.
>>>>>> 1.Data collection and storage - Eagle collects data from various
>>>>>> logs in real time using Kafka/Yarn API and uses HDFS and HBase for
>>>>>> 2.Data processing and policy engine - Eagle allows users to create
>>>>>> policies based on various metadata properties on HDFS, Hive and
>>>>>>HBase data.
>>>>>> 3.Eagle services - Eagle services include policy manager, query
>>>>>> and the visualization component. Eagle provides intuitive user
>>>>>>interface to
>>>>>> administer Eagle and an alert dashboard to respond to real time
>>>>>> Data Collection and Storage:
>>>>>> Eagle provides programming API for extending Eagle to integrate any
>>>>>> source into Eagle policy evaluation framework. For example, Eagle
>>>>>> audit monitoring collects data from Kafka which is populated from
>>>>>> log4j appender or from logstash agent. Eagle hive monitoring
>>>>>>collects hive
>>>>>> query logs from running job through YARN API, which is designed to
>>>>>> scalable and fault-tolerant. Eagle uses HBase as storage for storing
>>>>>> metadata and metrics data, and also supports relational database
>>>>>> configuration change.
>>>>>> Data Processing and Policy Engine:
>>>>>> Processing Engine: Eagle provides stream processing API which is
>>>>>> abstraction of Apache Storm. It can also be extended to other
>>>>>> engines. This abstraction allows developers to assemble data
>>>>>> transformation, filtering, external data join etc. without
>>>>>>physically bound
>>>>>> to a specific streaming platform. Eagle streaming API allows
>>>>>>developers to
>>>>>> easily integrate business logic with Eagle policy engine and
>>>>>> Eagle framework compiles business logic execution DAG into program
>>>>>> primitives of underlying stream infrastructure e.g. Apache Storm.
>>>>>> example, Eagle HDFS monitoring transforms audit log from Namenode
>>>>>>to object
>>>>>> and joins sensitivity metadata, security zone metadata which are
>>>>>> from external programs or configured by user. Eagle hive monitoring
>>>>>> running jobs to get hive query string and parses query string into
>>>>>> and then joins sensitivity metadata.
>>>>>> Alerting Framework: Eagle Alert Framework includes stream metadata
>>>>>> scalable policy engine framework, extensible policy engine
>>>>>> Stream metadata API allows developers to declare event schema
>>>>>> what attributes constitute an event, what is the type for each
>>>>>> and how to dynamically resolve attribute value in runtime when user
>>>>>> configures policy. Scalable policy engine framework allows policies
>>>>>>to be
>>>>>> executed on different physical nodes in parallel. It is also used
>>>>>>to define
>>>>>> your own policy partitioner class. Policy engine framework together
>>>>>> streaming partitioning capability provided by all streaming
>>>>>>platforms will
>>>>>> make sure policies and events can be evaluated in a fully
>>>>>>distributed way.
>>>>>> Extensible policy engine framework allows developer to plugin a new
>>>>>> engine with a few lines of codes. WSO2 Siddhi CEP engine is the
>>>>>> engine which Eagle supports as first-class citizen.
>>>>>> Machine Learning module: Eagle provides capabilities to define user
>>>>>> activity patterns or user profiles for Hadoop users based on the
>>>>>> behaviour in the platform. These user profiles are modeled using
>>>>>> Learning algorithms and used for detection of anomalous users
>>>>>> Eagle uses Eigen Value Decomposition, and Density Estimation
>>>>>>algorithms for
>>>>>> generating user profile models. The model reads data from HDFS
>>>>>>audit logs,
>>>>>> preprocesses and aggregates data, and generates models using Spark
>>>>>> programming APIs. Once models are generated, Eagle uses stream
>>>>>> engine for near real-time anomaly detection to determine if any
>>>>>> activities are suspicious or not.
>>>>>> Eagle Services:
>>>>>> Query Service: Eagle provides SQL-like service API to support
>>>>>> comprehensive computation for huge set of data on the fly, for e.g.
>>>>>> comprehensive filtering, aggregation, histogram, sorting, top,
>>>>>> expression, pagination etc. HBase is the data storage which Eagle
>>>>>> as first-class citizen, relational database is supported as well.
>>>>>>For HBase
>>>>>> storage, Eagle query framework compiles user provided SQL-like
>>>>>>query into
>>>>>> HBase native filter objects and execute it through HBase
>>>>>>coprocessor on the
>>>>>> fly.
>>>>>> Policy Manager: Eagle policy manager provides UI and Restful API
>>>>>>for user
>>>>>> to define policy with just a few clicks. It includes site
>>>>>>management UI,
>>>>>> policy editor, sensitivity metadata import, HDFS or Hive sensitive
>>>>>> browsing, alert dashboards etc.
>>>>>> Background
>>>>>> Data is one of the most important assets for today¹s businesses,
>>>>>> makes data security one of the top priorities of today¹s
>>>>>> Hadoop is widely used across different verticals as a big data
>>>>>> to store this data in most modern enterprises.
>>>>>> At eBay we use hadoop platform extensively for our data processing
>>>>>> Our data in Hadoop is becoming bigger and bigger as our user base
>>>>>>is seeing
>>>>>> an exponential growth. Today there are variety of data sets
>>>>>>available in
>>>>>> Hadoop cluster for our users to consume. eBay has around 120 PB of
>>>>>> stored in HDFS across 6 different clusters and around 1800+ active
>>>>>> users consuming data thru Hive, HBase and mapreduce jobs everyday
>>>>>>to build
>>>>>> applications using this data. With this astronomical growth of data
>>>>>> are also challenges in securing sensitive data and monitoring the
>>>>>>access to
>>>>>> this sensitive data. Today in large organizations HDFS is the
>>>>>> standard for storing big data. Data sets which includes and not
>>>>>>limited to
>>>>>> consumer sentiment, social media data, customer segmentation, web
>>>>>> sensor data, geo-location and transaction data get stored in Hadoop
>>>>>>for day
>>>>>> to day business needs.
>>>>>> We at eBay want to make sure the sensitive data and data platforms
>>>>>> completely protected from security breaches. So we partnered very
>>>>>> with our Information Security team to understand the requirements
>>>>>>for Eagle
>>>>>> to monitor sensitive data access on hadoop:
>>>>>> 1.Ability to identify and stop security threats in real time
>>>>>> 2.Scale for big data (Support PB scale and Billions of events)
>>>>>> 3.Ability to create data access policies
>>>>>> 4.Support multiple data sources like HDFS, HBase, Hive
>>>>>> 5.Visualize alerts in real time
>>>>>> 6.Ability to block malicious access in real time
>>>>>> We did not find any data access monitoring solution that available
>>>>>> and can provide the features and functionality that we need to
>>>>>>monitor the
>>>>>> data access in the hadoop ecosystem at our scale. Hence with an
>>>>>> team of world class developers and several users, we have been able
>>>>>> bring Eagle into production as well as open source it.
>>>>>> Rationale
>>>>>> In today¹s world; data is an important asset for any company.
>>>>>> are using data extensively to create amazing experiences for users.
>>>>>> has to be protected and access to data should be secured from
>>>>>> breaches. Today Hadoop is not only used to store logs but also
>>>>>> financial data, sensitive data sets, geographical data, user click
>>>>>> data sets etc. which makes it more important to be protected from
>>>>>> breaches. To secure a data platform there are multiple things that
>>>>>>need to
>>>>>> happen. One is having a strong access control mechanism which today
>>>>>> provided by Apache Ranger and Apache Sentry. These tools provide
>>>>>> ability to provide fine grain access control mechanism to data sets
>>>>>> hadoop. But there is a big gap in terms of monitoring all the data
>>>>>> events and activities in order to securing the hadoop data platform.
>>>>>> Together with strong access control, perimeter security and data
>>>>>> monitoring in place data in the hadoop clusters can be secu
>>>>> r
>>>>> ed against breaches. We looked around and found following:
>>>>>> Existing data activity monitoring products are designed for
>>>>>> databases and data warehouse. Existing monitoring platforms cannot
>>>>>> out to support fast growing data and petabyte scale. Few products
>>>>>>in the
>>>>>> industry are still very early in terms of supporting HDFS, Hive,
>>>>>>HBase data
>>>>>> access monitoring.
>>>>>> As mentioned in the background, the business requirement and
>>>>>>urgency to
>>>>>> secure the data from users with malicious intent drove eBay to
>>>>>>invest in
>>>>>> building a real time data access monitoring solution from scratch
>>>>>>to offer
>>>>>> real time alerts and remediation features for malicious data access.
>>>>>> With the power of open source distributed systems like Hadoop,
>>>>>>Kafka and
>>>>>> much more we were able to develop a data activity monitoring system
>>>>>> can scale, identify and stop malicious access in real time.
>>>>>> Eagle allows admins to create standard access policies and rules
>>>>>> monitoring HDFS, Hive and HBase data. Eagle also provides out of
>>>>>> machine learning models for modeling user profiles based on user
>>>>>> behaviour and use the model to alert on anomalies.
>>>>>> Current Status
>>>>>> Meritocracy
>>>>>> Eagle has been deployed in production at eBay for monitoring
>>>>>>billions of
>>>>>> events per day from HDFS and Hive operations. From the start; the
>>>>>> has been built with focus on high scalability and application
>>>>>> in mind and Eagle has demonstrated great performance in responding
>>>>>> suspicious events instantly and great flexibility in defining
>>>>>> Community
>>>>>> Eagle seeks to develop the developer and user communities during
>>>>>> incubation.
>>>>>> Core Developers
>>>>>> Eagle is currently being designed and developed by engineers from
>>>>>> Inc. ­ Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin
>>>>>> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All
>>>>>> these core developers have deep expertise in developing monitoring
>>>>>> for the Hadoop ecosystem.
>>>>>> Alignment
>>>>>> The ASF is a natural host for Eagle given that it is already the
>>>>>>home of
>>>>>> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
>>>>>> projects. Eagle leverages lot of Apache open-source products. Eagle
>>>>>> designed to offer real time insights into sensitive data access by
>>>>>> monitoring the data access on various data sets in hadoop and an
>>>>>> alerting framework with a powerful policy engine. Eagle compliments
>>>>>> existing Hadoop platform area by providing a comprehensive
>>>>>>monitoring and
>>>>>> alerting solution for detecting sensitive data access threats based
>>>>>> preset policies and machine learning models for user behaviour
>>>>>> Known Risks
>>>>>> Orphaned Products
>>>>>> The core developers of Eagle team work full time on this project.
>>>>>> is no risk of Eagle getting orphaned since eBay is extensively
>>>>>>using it in
>>>>>> their production Hadoop clusters and have plans to go beyond
>>>>>>hadoop. For
>>>>>> example, currently there are 7 hadoop clusters and 2 of them are
>>>>>> monitored using Hadoop Eagle in production. We have plans to extend
>>>>>>it to
>>>>>> all hadoop clusters and eventually other data platforms. There are
>>>>>>10¹s of
>>>>>> policies onboarded and actively monitored with plans to onboard
>>>>>>more use
>>>>>> case. We are very confident that every hadoop cluster in the world
>>>>>>will be
>>>>>> monitored using Eagle for securing the hadoop ecosystem by actively
>>>>>> monitoring for data access on sensitive data. We plan to extend and
>>>>>> diversify this community further through Apache. We presented Eagle
>>>>>>at the
>>>>>> hadoop summit in china and garnered interest from different
>>>>>>companies who
>>>>>> use hadoop extensively.
>>>>>> Inexperience with Open Source
>>>>>> The core developers are all active users and followers of open
>>>>>> They are already committers and contributors to the Eagle Github
>>>>>> All have been involved with the source code that has been released
>>>>>>under an
>>>>>> open source license, and several of them also have experience
>>>>>> code in an open source environment. Though the core set of
>>>>>>Developers do
>>>>>> not have Apache Open Source experience, there are plans to onboard
>>>>>> individuals with Apache open source experience on to the project.
>>>>>> Kylin PMC members are also in the same ebay organization. We work
>>>>>> closely with Apache Ranger committers and are looking forward to
>>>>>> meaningful integrations to improve the security of hadoop platform.
>>>>>> Homogenous Developers
>>>>>> The core developers are from eBay. Today the problem of monitoring
>>>>>> activities to find and stop threats is a universal problem faced
>>>>>>all the
>>>>>> businesses. Apache Incubation process encourages an open and diverse
>>>>>> meritocratic community. Eagle intends to make every possible effort
>>>>>> build a diverse, vibrant and involved community and has already
>>>>>> substantial interest from various organizations.
>>>>>> Reliance on Salaried Developers
>>>>>> eBay invested in Eagle as the monitoring solution for Hadoop
>>>>>>clusters and
>>>>>> some of its key engineers are working full time on the project. In
>>>>>> addition, since there is a growing need for securing sensitive data
>>>>>> we need a data activity monitoring solution for Hadoop, we look
>>>>>>forward to
>>>>>> other Apache developers and researchers to contribute to the
>>>>>> Additional contributors, including Apache committers have plans to
>>>>>> this effort shortly. Also key to addressing the risk associated with
>>>>>> relying on Salaried developers from a single entity is to increase
>>>>>> diversity of the contributors and actively lobby for Domain experts
>>>>>>in the
>>>>>> security space to contribute. Eagle intends to do this.
>>>>>> Relationships with Other Apache Products
>>>>>> Eagle has a strong relationship and dependency with Apache Hadoop,
>>>>>> Spark, Kafka and Storm. Being part of Apache¹s Incubation
>>>>>>community, could
>>>>>> help with a closer collaboration among these projects and as well
>>>>>> others. An Excessive Fascination with the Apache Brand Eagle is
>>>>>> to enter incubation at Apache in order to help efforts to diversify
>>>>>> committer-base, not so much to capitalize on the Apache brand. The
>>>>>> project is in production use already inside eBay, but is not
>>>>>>expected to be
>>>>>> an eBay product for external customers. As such, the Eagle project
>>>>>>is not
>>>>>> seeking to use the Apache brand as a marketing tool.
>>>>>> Documentation
>>>>>> Information about Eagle can be found at
>>>>>> The following link provide more information about Eagle
>>>>>> .
>>>>>> Initial Source
>>>>>> Eagle has been under development since 2014 by a team of engineers
>>>>>> eBay Inc. It is currently hosted on under an Apache
>>>>>>license 2.0
>>>>>> at Once in incubation we will be
>>>>>> the code base to apache git library.
>>>>>> External Dependencies
>>>>>> Eagle has the following external dependencies.
>>>>>> Basic
>>>>>> ?JDK 1.7+
>>>>>> ?Scala 2.10.4
>>>>>> ?Apache Maven
>>>>>> ?JUnit
>>>>>> ?Log4j
>>>>>> ?Slf4j
>>>>>> ?Apache Commons
>>>>>> ?Apache Commons Math3
>>>>>> ?Jackson
>>>>>> ?Siddhi CEP engine
>>>>>> Hadoop
>>>>>> ?Apache Hadoop
>>>>>> ?Apache HBase
>>>>>> ?Apache Hive
>>>>>> ?Apache Zookeeper
>>>>>> ?Apache Curator
>>>>>> Apache Spark
>>>>>> ?Spark Core Library
>>>>>> REST Service
>>>>>> ?Jersey
>>>>>> Query
>>>>>> ?Antlr
>>>>>> Stream processing
>>>>>> ?Apache Storm
>>>>>> ?Apache Kafka
>>>>>> Web
>>>>>> ?AngularJS
>>>>>> ?jQuery
>>>>>> ?Bootstrap V3
>>>>>> ?Moment JS
>>>>>> ?Admin LTE
>>>>>> ?html5shiv
>>>>>> ?respond
>>>>>> ?Fastclick
>>>>>> ?Date Range Picker
>>>>>> ?Flot JS
>>>>>> Cryptography
>>>>>> Eagle will eventually support encryption on the wire. This is not
>>>>>>one of
>>>>>> the initial goals, and we do not expect Eagle to be a controlled
>>>>>> item due to the use of encryption. Eagle supports but does not
>>>>>>require the
>>>>>> Kerberos authentication mechanism to access secured Hadoop services.
>>>>>> Required Resources
>>>>>> Mailing List
>>>>>> ?eagle-private for private PMC discussions
>>>>>> ?eagle-dev for developers
>>>>>> ?eagle-commits for all commits
>>>>>> ?eagle-users for all eagle users
>>>>>> Subversion Directory
>>>>>> ?Git is the preferred source control system.
>>>>>> Issue Tracking
>>>>>> ?JIRA Eagle (Eagle)
>>>>>> Other Resources
>>>>>> The existing code already has unit tests so we will make use of
>>>>>> Apache continuous testing infrastructure. The resulting load should
>>>>>>not be
>>>>>> very large.
>>>>>> Initial Committers
>>>>>> ?Seshu Adunuthula <sadunuthula at ebay dot com>
>>>>>> ?Arun Manoharan <armanoharan at ebay dot com>
>>>>>> ?Edward Zhang <yonzhang at ebay dot com>
>>>>>> ?Hao Chen <hchen9 at ebay dot com>
>>>>>> ?Chaitali Gupta <cgupta at ebay dot com>
>>>>>> ?Libin Sun <libsun at ebay dot com>
>>>>>> ?Jilin Jiang <jiljiang at ebay dot com>
>>>>>> ?Qingwen Zhao <qingwzhao at ebay dot com>
>>>>>> ?Hemanth Dendukuri <hdendukuri at ebay dot com>
>>>>>> ?Senthil Kumar <senthilkumar at ebay dot com>
>>>>>> ?Tan Chen <tanchen at ebay dot com>
>>>>>> Affiliations
>>>>>> The initial committers are employees of eBay Inc.
>>>>>> Sponsors
>>>>>> Champion
>>>>>> ?Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>>>>> Nominated Mentors
>>>>>> ?Owen O¹Malley < omalley at apache dot org > - Apache IPMC
>>>>>> Hortonworks
>>>>>> ?Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>>>>> ?Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
>>>>>> Hortonworks
>>>>>> Sponsoring Entity
>>>>>> We are requesting the Incubator to sponsor this project.
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> Talend -
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message