incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Wagle <swa...@hortonworks.com>
Subject Re: [DISCUSS] Eagle incubator proposal
Date Wed, 21 Oct 2015 16:22:52 GMT
Hi Arun,

This proposal looks great. I would like to be an active contributor on this project. I bring
with me the experience of Apache Ambari and developing the Ambari Metrics System.

Best Regards,
Sid
________________________________________
From: Julian Hyde <jhyde@apache.org>
Sent: Wednesday, October 21, 2015 9:10 AM
To: general@incubator.apache.org
Subject: Re: [DISCUSS] Eagle incubator proposal

My name is already on the list of mentors. I think this project fills an important need. Several
of the initial committers were involved with Kylin and therefore know the Apache process.

Julian


> On Oct 20, 2015, at 11:58 AM, P. Taylor Goetz <ptgoetz@gmail.com> wrote:
>
> I should also have some improved bandwidth both now that Kylin is nearing graduation
and for other reasons. I’ve been bogged down recently, but that’s starting to change.
>
> If more mentors are desired, I’d be willing to help in that respect.
>
> -Taylor
>
>> On Oct 20, 2015, at 11:49 AM, Henry Saputra <henry.saputra@gmail.com> wrote:
>>
>> Hi Ted,
>>
>> Since Kylin almost ready to graduate, I have more bandwidth to help with Eagle.
>>
>> But, you are right that current proposed mentors for Eagle seemed to
>> be very busy with other podlings, so 1 or 2 additional mentors would
>> be great.
>>
>> The good news is that the team consist some people from Kylin, for
>> example Luke, which done great job helping Kylin to understand working
>> with Apache way.
>> So we have some help from initial committers who have done the rodeo before.
>>
>> - Henry
>>
>> On Mon, Oct 19, 2015 at 9:00 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>> I would suggest that Owen O'Malley has not had enough time to be a viable
>>> mentor recently and should not be on the list of mentors.
>>>
>>> Henry and Julian are good if their schedules permit.  Henry, I know has
>>> been mentoring a number of projects lately.
>>>
>>>
>>>
>>> On Mon, Oct 19, 2015 at 8:40 AM, Jean-Baptiste Onofré <jb@nanthrax.net>
>>> wrote:
>>>
>>>> Hi Arun,
>>>>
>>>> very interesting proposal. I may see some possible interaction with
>>>> Falcon. In Falcon, we have HDFS files (and Hive/HBase) monitoring (with a
>>>> kind of Change Data Capture), etc.
>>>>
>>>> So, I see a different perspective in Eagle, but Eagle could also leverage
>>>> Falcon somehow.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 10/19/2015 05:33 PM, Manoharan, Arun wrote:
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> My name is Arun Manoharan. Currently a product manager in the Analytics
>>>>> platform team at eBay Inc.
>>>>>
>>>>> I would like to start a discussion on Eagle and its joining the ASF as
an
>>>>> incubation project.
>>>>>
>>>>> Eagle is a Monitoring solution for Hadoop to instantly identify access
to
>>>>> sensitive data, recognize attacks, malicious activities and take actions
in
>>>>> real time. Eagle supports a wide variety of policies on HDFS data and
Hive.
>>>>> Eagle also provides machine learning models for detecting anomalous user
>>>>> behavior in Hadoop.
>>>>>
>>>>> The proposal is available on the wiki here:
>>>>> https://wiki.apache.org/incubator/EagleProposal
>>>>>
>>>>> The text of the proposal is also available at the end of this email.
>>>>>
>>>>> Thanks for your time and help.
>>>>>
>>>>> Thanks,
>>>>> Arun
>>>>>
>>>>> <COPY of the proposal in text format>
>>>>>
>>>>> Eagle
>>>>>
>>>>> Abstract
>>>>> Eagle is an Open Source Monitoring solution for Hadoop to instantly
>>>>> identify access to sensitive data, recognize attacks, malicious activities
>>>>> in hadoop and take actions.
>>>>>
>>>>> Proposal
>>>>> Eagle audits access to HDFS files, Hive and HBase tables in real time,
>>>>> enforces policies defined on sensitive data access and alerts or blocks
>>>>> user’s access to that sensitive data in real time. Eagle also creates
user
>>>>> profiles based on the typical access behaviour for HDFS and Hive and
sends
>>>>> alerts when anomalous behaviour is detected. Eagle can also import
>>>>> sensitive data information classified by external classification engines
to
>>>>> help define its policies.
>>>>>
>>>>> Overview of Eagle
>>>>> Eagle has 3 main parts.
>>>>> 1.Data collection and storage - Eagle collects data from various hadoop
>>>>> logs in real time using Kafka/Yarn API and uses HDFS and HBase for storage.
>>>>> 2.Data processing and policy engine - Eagle allows users to create
>>>>> policies based on various metadata properties on HDFS, Hive and HBase
data.
>>>>> 3.Eagle services - Eagle services include policy manager, query service
>>>>> and the visualization component. Eagle provides intuitive user interface
to
>>>>> administer Eagle and an alert dashboard to respond to real time alerts.
>>>>>
>>>>> Data Collection and Storage:
>>>>> Eagle provides programming API for extending Eagle to integrate any data
>>>>> source into Eagle policy evaluation framework. For example, Eagle hdfs
>>>>> audit monitoring collects data from Kafka which is populated from namenode
>>>>> log4j appender or from logstash agent. Eagle hive monitoring collects
hive
>>>>> query logs from running job through YARN API, which is designed to be
>>>>> scalable and fault-tolerant. Eagle uses HBase as storage for storing
>>>>> metadata and metrics data, and also supports relational database through
>>>>> configuration change.
>>>>>
>>>>> Data Processing and Policy Engine:
>>>>> Processing Engine: Eagle provides stream processing API which is an
>>>>> abstraction of Apache Storm. It can also be extended to other streaming
>>>>> engines. This abstraction allows developers to assemble data
>>>>> transformation, filtering, external data join etc. without physically
bound
>>>>> to a specific streaming platform. Eagle streaming API allows developers
to
>>>>> easily integrate business logic with Eagle policy engine and internally
>>>>> Eagle framework compiles business logic execution DAG into program
>>>>> primitives of underlying stream infrastructure e.g. Apache Storm. For
>>>>> example, Eagle HDFS monitoring transforms audit log from Namenode to
object
>>>>> and joins sensitivity metadata, security zone metadata which are generated
>>>>> from external programs or configured by user. Eagle hive monitoring filters
>>>>> running jobs to get hive query string and parses query string into object
>>>>> and then joins sensitivity metadata.
>>>>> Alerting Framework: Eagle Alert Framework includes stream metadata API,
>>>>> scalable policy engine framework, extensible policy engine framework.
>>>>> Stream metadata API allows developers to declare event schema including
>>>>> what attributes constitute an event, what is the type for each attribute,
>>>>> and how to dynamically resolve attribute value in runtime when user
>>>>> configures policy. Scalable policy engine framework allows policies to
be
>>>>> executed on different physical nodes in parallel. It is also used to
define
>>>>> your own policy partitioner class. Policy engine framework together with
>>>>> streaming partitioning capability provided by all streaming platforms
will
>>>>> make sure policies and events can be evaluated in a fully distributed
way.
>>>>> Extensible policy engine framework allows developer to plugin a new policy
>>>>> engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy
>>>>> engine which Eagle supports as first-class citizen.
>>>>> Machine Learning module: Eagle provides capabilities to define user
>>>>> activity patterns or user profiles for Hadoop users based on the user
>>>>> behaviour in the platform. These user profiles are modeled using Machine
>>>>> Learning algorithms and used for detection of anomalous users activities.
>>>>> Eagle uses Eigen Value Decomposition, and Density Estimation algorithms
for
>>>>> generating user profile models. The model reads data from HDFS audit
logs,
>>>>> preprocesses and aggregates data, and generates models using Spark
>>>>> programming APIs. Once models are generated, Eagle uses stream processing
>>>>> engine for near real-time anomaly detection to determine if any user’s
>>>>> activities are suspicious or not.
>>>>>
>>>>> Eagle Services:
>>>>> Query Service: Eagle provides SQL-like service API to support
>>>>> comprehensive computation for huge set of data on the fly, for e.g.
>>>>> comprehensive filtering, aggregation, histogram, sorting, top, arithmetical
>>>>> expression, pagination etc. HBase is the data storage which Eagle supports
>>>>> as first-class citizen, relational database is supported as well. For
HBase
>>>>> storage, Eagle query framework compiles user provided SQL-like query
into
>>>>> HBase native filter objects and execute it through HBase coprocessor
on the
>>>>> fly.
>>>>> Policy Manager: Eagle policy manager provides UI and Restful API for
user
>>>>> to define policy with just a few clicks. It includes site management
UI,
>>>>> policy editor, sensitivity metadata import, HDFS or Hive sensitive resource
>>>>> browsing, alert dashboards etc.
>>>>> Background
>>>>> Data is one of the most important assets for today’s businesses, which
>>>>> makes data security one of the top priorities of today’s enterprises.
>>>>> Hadoop is widely used across different verticals as a big data repository
>>>>> to store this data in most modern enterprises.
>>>>> At eBay we use hadoop platform extensively for our data processing needs.
>>>>> Our data in Hadoop is becoming bigger and bigger as our user base is
seeing
>>>>> an exponential growth. Today there are variety of data sets available
in
>>>>> Hadoop cluster for our users to consume. eBay has around 120 PB of data
>>>>> stored in HDFS across 6 different clusters and around 1800+ active hadoop
>>>>> users consuming data thru Hive, HBase and mapreduce jobs everyday to
build
>>>>> applications using this data. With this astronomical growth of data there
>>>>> are also challenges in securing sensitive data and monitoring the access
to
>>>>> this sensitive data. Today in large organizations HDFS is the defacto
>>>>> standard for storing big data. Data sets which includes and not limited
to
>>>>> consumer sentiment, social media data, customer segmentation, web clicks,
>>>>> sensor data, geo-location and transaction data get stored in Hadoop for
day
>>>>> to day business needs.
>>>>> We at eBay want to make sure the sensitive data and data platforms are
>>>>> completely protected from security breaches. So we partnered very closely
>>>>> with our Information Security team to understand the requirements for
Eagle
>>>>> to monitor sensitive data access on hadoop:
>>>>> 1.Ability to identify and stop security threats in real time
>>>>> 2.Scale for big data (Support PB scale and Billions of events)
>>>>> 3.Ability to create data access policies
>>>>> 4.Support multiple data sources like HDFS, HBase, Hive
>>>>> 5.Visualize alerts in real time
>>>>> 6.Ability to block malicious access in real time
>>>>> We did not find any data access monitoring solution that available today
>>>>> and can provide the features and functionality that we need to monitor
the
>>>>> data access in the hadoop ecosystem at our scale. Hence with an excellent
>>>>> team of world class developers and several users, we have been able to
>>>>> bring Eagle into production as well as open source it.
>>>>>
>>>>> Rationale
>>>>> In today’s world; data is an important asset for any company. Businesses
>>>>> are using data extensively to create amazing experiences for users. Data
>>>>> has to be protected and access to data should be secured from security
>>>>> breaches. Today Hadoop is not only used to store logs but also stores
>>>>> financial data, sensitive data sets, geographical data, user click stream
>>>>> data sets etc. which makes it more important to be protected from security
>>>>> breaches. To secure a data platform there are multiple things that need
to
>>>>> happen. One is having a strong access control mechanism which today is
>>>>> provided by Apache Ranger and Apache Sentry. These tools provide the
>>>>> ability to provide fine grain access control mechanism to data sets on
>>>>> hadoop. But there is a big gap in terms of monitoring all the data access
>>>>> events and activities in order to securing the hadoop data platform.
>>>>> Together with strong access control, perimeter security and data access
>>>>> monitoring in place data in the hadoop clusters can be secu
>>>>>
>>>> r
>>>> ed against breaches. We looked around and found following:
>>>>
>>>>> Existing data activity monitoring products are designed for traditional
>>>>> databases and data warehouse. Existing monitoring platforms cannot scale
>>>>> out to support fast growing data and petabyte scale. Few products in
the
>>>>> industry are still very early in terms of supporting HDFS, Hive, HBase
data
>>>>> access monitoring.
>>>>> As mentioned in the background, the business requirement and urgency
to
>>>>> secure the data from users with malicious intent drove eBay to invest
in
>>>>> building a real time data access monitoring solution from scratch to
offer
>>>>> real time alerts and remediation features for malicious data access.
>>>>> With the power of open source distributed systems like Hadoop, Kafka
and
>>>>> much more we were able to develop a data activity monitoring system that
>>>>> can scale, identify and stop malicious access in real time.
>>>>> Eagle allows admins to create standard access policies and rules for
>>>>> monitoring HDFS, Hive and HBase data. Eagle also provides out of box
>>>>> machine learning models for modeling user profiles based on user access
>>>>> behaviour and use the model to alert on anomalies.
>>>>>
>>>>> Current Status
>>>>>
>>>>> Meritocracy
>>>>> Eagle has been deployed in production at eBay for monitoring billions
of
>>>>> events per day from HDFS and Hive operations. From the start; the product
>>>>> has been built with focus on high scalability and application extensibility
>>>>> in mind and Eagle has demonstrated great performance in responding to
>>>>> suspicious events instantly and great flexibility in defining policy.
>>>>>
>>>>> Community
>>>>> Eagle seeks to develop the developer and user communities during
>>>>> incubation.
>>>>>
>>>>> Core Developers
>>>>> Eagle is currently being designed and developed by engineers from eBay
>>>>> Inc. – Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
>>>>> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
>>>>> these core developers have deep expertise in developing monitoring products
>>>>> for the Hadoop ecosystem.
>>>>>
>>>>> Alignment
>>>>> The ASF is a natural host for Eagle given that it is already the home
of
>>>>> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
>>>>> projects. Eagle leverages lot of Apache open-source products. Eagle was
>>>>> designed to offer real time insights into sensitive data access by actively
>>>>> monitoring the data access on various data sets in hadoop and an extensible
>>>>> alerting framework with a powerful policy engine. Eagle compliments the
>>>>> existing Hadoop platform area by providing a comprehensive monitoring
and
>>>>> alerting solution for detecting sensitive data access threats based on
>>>>> preset policies and machine learning models for user behaviour analysis.
>>>>>
>>>>> Known Risks
>>>>>
>>>>> Orphaned Products
>>>>> The core developers of Eagle team work full time on this project. There
>>>>> is no risk of Eagle getting orphaned since eBay is extensively using
it in
>>>>> their production Hadoop clusters and have plans to go beyond hadoop.
For
>>>>> example, currently there are 7 hadoop clusters and 2 of them are being
>>>>> monitored using Hadoop Eagle in production. We have plans to extend it
to
>>>>> all hadoop clusters and eventually other data platforms. There are 10’s
of
>>>>> policies onboarded and actively monitored with plans to onboard more
use
>>>>> case. We are very confident that every hadoop cluster in the world will
be
>>>>> monitored using Eagle for securing the hadoop ecosystem by actively
>>>>> monitoring for data access on sensitive data. We plan to extend and
>>>>> diversify this community further through Apache. We presented Eagle at
the
>>>>> hadoop summit in china and garnered interest from different companies
who
>>>>> use hadoop extensively.
>>>>>
>>>>> Inexperience with Open Source
>>>>> The core developers are all active users and followers of open source.
>>>>> They are already committers and contributors to the Eagle Github project.
>>>>> All have been involved with the source code that has been released under
an
>>>>> open source license, and several of them also have experience developing
>>>>> code in an open source environment. Though the core set of Developers
do
>>>>> not have Apache Open Source experience, there are plans to onboard
>>>>> individuals with Apache open source experience on to the project. Apache
>>>>> Kylin PMC members are also in the same ebay organization. We work very
>>>>> closely with Apache Ranger committers and are looking forward to find
>>>>> meaningful integrations to improve the security of hadoop platform.
>>>>>
>>>>> Homogenous Developers
>>>>> The core developers are from eBay. Today the problem of monitoring data
>>>>> activities to find and stop threats is a universal problem faced by all
the
>>>>> businesses. Apache Incubation process encourages an open and diverse
>>>>> meritocratic community. Eagle intends to make every possible effort to
>>>>> build a diverse, vibrant and involved community and has already received
>>>>> substantial interest from various organizations.
>>>>>
>>>>> Reliance on Salaried Developers
>>>>> eBay invested in Eagle as the monitoring solution for Hadoop clusters
and
>>>>> some of its key engineers are working full time on the project. In
>>>>> addition, since there is a growing need for securing sensitive data access
>>>>> we need a data activity monitoring solution for Hadoop, we look forward
to
>>>>> other Apache developers and researchers to contribute to the project.
>>>>> Additional contributors, including Apache committers have plans to join
>>>>> this effort shortly. Also key to addressing the risk associated with
>>>>> relying on Salaried developers from a single entity is to increase the
>>>>> diversity of the contributors and actively lobby for Domain experts in
the
>>>>> security space to contribute. Eagle intends to do this.
>>>>>
>>>>> Relationships with Other Apache Products
>>>>> Eagle has a strong relationship and dependency with Apache Hadoop, HBase,
>>>>> Spark, Kafka and Storm. Being part of Apache’s Incubation community,
could
>>>>> help with a closer collaboration among these projects and as well as
>>>>> others. An Excessive Fascination with the Apache Brand Eagle is proposing
>>>>> to enter incubation at Apache in order to help efforts to diversify the
>>>>> committer-base, not so much to capitalize on the Apache brand. The Eagle
>>>>> project is in production use already inside eBay, but is not expected
to be
>>>>> an eBay product for external customers. As such, the Eagle project is
not
>>>>> seeking to use the Apache brand as a marketing tool.
>>>>>
>>>>> Documentation
>>>>> Information about Eagle can be found at https://github.com/eBay/Eagle.
>>>>> The following link provide more information about Eagle http://goeagle.io
>>>>> .
>>>>>
>>>>> Initial Source
>>>>> Eagle has been under development since 2014 by a team of engineers at
>>>>> eBay Inc. It is currently hosted on Github.com under an Apache license
2.0
>>>>> at https://github.com/eBay/Eagle. Once in incubation we will be moving
>>>>> the code base to apache git library.
>>>>>
>>>>> External Dependencies
>>>>> Eagle has the following external dependencies.
>>>>> Basic
>>>>> •JDK 1.7+
>>>>> •Scala 2.10.4
>>>>> •Apache Maven
>>>>> •JUnit
>>>>> •Log4j
>>>>> •Slf4j
>>>>> •Apache Commons
>>>>> •Apache Commons Math3
>>>>> •Jackson
>>>>> •Siddhi CEP engine
>>>>>
>>>>> Hadoop
>>>>> •Apache Hadoop
>>>>> •Apache HBase
>>>>> •Apache Hive
>>>>> •Apache Zookeeper
>>>>> •Apache Curator
>>>>>
>>>>> Apache Spark
>>>>> •Spark Core Library
>>>>>
>>>>> REST Service
>>>>> •Jersey
>>>>>
>>>>> Query
>>>>> •Antlr
>>>>>
>>>>> Stream processing
>>>>> •Apache Storm
>>>>> •Apache Kafka
>>>>>
>>>>> Web
>>>>> •AngularJS
>>>>> •jQuery
>>>>> •Bootstrap V3
>>>>> •Moment JS
>>>>> •Admin LTE
>>>>> •html5shiv
>>>>> •respond
>>>>> •Fastclick
>>>>> •Date Range Picker
>>>>> •Flot JS
>>>>>
>>>>> Cryptography
>>>>> Eagle will eventually support encryption on the wire. This is not one
of
>>>>> the initial goals, and we do not expect Eagle to be a controlled export
>>>>> item due to the use of encryption. Eagle supports but does not require
the
>>>>> Kerberos authentication mechanism to access secured Hadoop services.
>>>>>
>>>>> Required Resources
>>>>>
>>>>> Mailing List
>>>>> •eagle-private for private PMC discussions
>>>>> •eagle-dev for developers
>>>>> •eagle-commits for all commits
>>>>> •eagle-users for all eagle users
>>>>>
>>>>> Subversion Directory
>>>>> •Git is the preferred source control system.
>>>>>
>>>>> Issue Tracking
>>>>> •JIRA Eagle (Eagle)
>>>>>
>>>>> Other Resources
>>>>> The existing code already has unit tests so we will make use of existing
>>>>> Apache continuous testing infrastructure. The resulting load should not
be
>>>>> very large.
>>>>>
>>>>> Initial Committers
>>>>> •Seshu Adunuthula <sadunuthula at ebay dot com>
>>>>> •Arun Manoharan <armanoharan at ebay dot com>
>>>>> •Edward Zhang <yonzhang at ebay dot com>
>>>>> •Hao Chen <hchen9 at ebay dot com>
>>>>> •Chaitali Gupta <cgupta at ebay dot com>
>>>>> •Libin Sun <libsun at ebay dot com>
>>>>> •Jilin Jiang <jiljiang at ebay dot com>
>>>>> •Qingwen Zhao <qingwzhao at ebay dot com>
>>>>> •Hemanth Dendukuri <hdendukuri at ebay dot com>
>>>>> •Senthil Kumar <senthilkumar at ebay dot com>
>>>>> •Tan Chen <tanchen at ebay dot com>
>>>>>
>>>>> Affiliations
>>>>> The initial committers are employees of eBay Inc.
>>>>>
>>>>> Sponsors
>>>>>
>>>>> Champion
>>>>> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>>>>
>>>>> Nominated Mentors
>>>>> •Owen O’Malley < omalley at apache dot org > - Apache IPMC
member,
>>>>> Hortonworks
>>>>> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>>>>> •Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
>>>>> Hortonworks
>>>>>
>>>>> Sponsoring Entity
>>>>> We are requesting the Incubator to sponsor this project.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message