incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen, Pei" <Pei.C...@childrens.harvard.edu>
Subject RE: [PROPOSAL] Blur for the Apache Incubator
Date Wed, 18 Jul 2012 19:00:01 GMT
This seems like a very interesting project.
Looking forward to see it in Apache...

-----Original Message-----
From: Aaron McCurry [mailto:amccurry@gmail.com] 
Sent: Friday, July 13, 2012 5:24 PM
To: general@incubator.apache.org
Subject: [PROPOSAL] Blur for the Apache Incubator

Hello!

I would like to propose Blur to be an Apache Incubator project.  Blur is a distributed search
platform built for low latency searches over large amounts of data.  Blur is scalable and
fault tolerant through the use of Hadoop and ZooKeeper.  Thrift is used as the RPC library
and the underlying search implementation uses Lucene and the Lucene query syntax.

The proposal can be found here:
http://wiki.apache.org/incubator/BlurProposal

I have included the contexts of the proposal below.

Thanks!
Aaron

= Blur Proposal =

== Abstract ==
Blur is a search platform capable of searching massive amounts of data in a cloud computing
environment. Blur leverages several existing Apache projects, including Apache Lucene, Apache
Hadoop, Apache !ZooKeeper and Apache Thrift.  Both bulk and near real time (NRT) updates are
possible with Blur.  Bulk updates are accomplished using Hadoop Map/Reduce and NRT are performed
through direct Thrift calls.

== Proposal ==
Blur is an open source search platform capable of querying massive amounts of data at incredible
speeds. Rather than using the flat, document-like data model used by most search solutions,
Blur allows you to build rich data models and search them in a semi-relational manner similar
to joins while querying a relational database. Using Blur, you can get precise search results
against terabytes of data at Google-like speeds.  Blur leverages multiple open source projects
including Hadoop, Lucene, Thrift and !ZooKeeper to create an environment where structured
data can be transformed into an index that runs on a Hadoop cluster.  Blur uses the power
of Map/Reduce for bulk indexing into Blur.  Server failures are handled automatically by using
!ZooKeeper for cluster state and HDFS for index storage.

== Background ==
Blur was created by Aaron !McCurry in 2010. Blur was developed to solve the challenges in
dealing with searching huge quantities of data that the traditional RDBMS solutions could
not cope with while still providing JOIN-like capabilities to query the data.  Several other
open source projects have implemented aspects of this design including elasticsearch, Katta
and Apache Solr.

== Rationale ==
There is a need for a distributed search capability within the Hadoop ecosystem. Currently,
there are no other search solutions that natively leverage HDFS and the failover features
of Hadoop in the same manner as the Blur project. The communities we expect to be most interested
in such a project are government, health care, and other industries where scalability is a
concern. We have made much progress in developing this project over the past 2 years and believe
both the project and the interested communities would benefit from this work being openly
available and having open development.  In future versions of Blur the API will more closely
follow the API's provided in Lucene so that systems that already use Lucene can more easily
scale with Blur. Blur can be viewed as a query execution engine that Lucene based solutions
can utilize when scale becomes an issue.

== Initial Goals ==
The initial goals of the project are:
 * To migrate the Blur codebase, issue tracking and wiki from github.com and integrate the
project with the ASF infrastructure.
 * Add new committers to the project and grow the community in "The Apache Way".

== Current Status ==

=== Meritocracy ===
Blur was initially developed by Aaron !McCurry in June 2010.  Since then Blur has continued
to evolve with the support of a small development team at Near Infinity.  As a part of the
Apache Software Foundation, the Apache Blur team intends to strongly encourage the community
to help with and contribute to the project.  Apache Blur will actively seek potential committers
and help them become familiar with the codebase.

=== Community ===
A small community has developed around Blur and several project teams are currently using
Blur for their big data search capability. The source code is currently available on GitHub
and there is a dedicated website (blur.io) that provides an overview of the project. Blur
has been shared with several members of the Apache community and has been presented at the
Bay Area HUG (see http://www.meetup.com/hadoop/events/20109471/).

=== Core Developers ===
The current developers are employed by Near Infinity Corporation, but we anticipate interest
developing among other companies.

=== Alignment ===
Blur is built on top of a number of Apache projects; Hadoop, Lucene, !ZooKeeper, and Thrift.
It builds with Maven.  During the course of Blur development, a couple of patches have been
committed back to the Lucene project, including LUCENE-2205 and LUCENE-2215.  Due to the strong
relationship with the before mentioned Apache projects, the incubator is a good match for
Blur.

== Known Risks ==

=== Orphaned Products ===
There is only a small risk of being orphaned. The customers that currently use Blur are committed
to improving the codebase of the project due to its fulfilling needs not addressed by any
other software. In addition, one customer is providing financial support to further develop
Blur given its importance on mission-critical projects.

=== Inexperience with Open Source ===
The codebase has been treated internally as an open source project since its beginning, and
Near Infinity has extensive experience developing and releasing open source projects (http://www.nearinfinity.com/products/open_source).
We do not anticipate difficulty in operating under the Apache Way.

=== Homogeneous Developers ===
Current developers are all employed by Near Infinity but we are actively seeking contributors
from different companies and would welcome their participation.

=== Reliance on Salaried Developers ===
Blur was originally created by Aaron !McCurry as a personal project and he remains the primary
contributor.  Currently, Aaron's employer (Near Infinity) fully supports his continued participation
with paid, dedicated time to work on Blur. All other current developers are paid by Near Infinity
to work on Blur as well.

=== Relationships with Other Apache Products === Blur dependencies:

 * Apache Hadoop
 * Apache Lucene
 * Apache !ZooKeeper
 * Apache Thrift
 * Apache log4j

=== Apache Brand ===
Our interest in releasing this code as an Apache project is due to its strong relationship
with other Apache projects, i.e. Blur has dependencies on Hadoop, Lucene, !ZooKeeper, and
Thrift and its uniqueness within the Hadoop ecosystem.

== Documentation ==
Current documentation can be found at http://blur.io and https://github.com/nearinfinity/blur.

== Initial Source ==
Blur has been in development since summer 2010. The core codebase consists of about ~29,000
(~10,000 if the generated RPC code is not
included) lines of code mainly Java.

== Source and Intellectual Property Submission Plan == Blur core code, examples, documentation,
and training materials will be submitted by Near Infinity Corporation.

== External Dependencies ==
 * concurrentlinkedhashmap - Apache 2.0 License - http://code.google.com/p/concurrentlinkedhashmap/

== Cryptography ==
none

== Required Resources ==
 * Mailing Lists
   * blur-private
   * blur-dev
   * blur-commits
   * blur-user
 * Subversion Directory
   * https://git-wip-us.apache.org/repos/asf/blur.git
 * Issue Tracking
   * JIRA
 * Continuous Integration
   * Jenkins
 * Web
   * http://incubator.apache.org/blur/wiki at http://wiki.apache.org or http://cwiki.apache.org

== Initial Committers ==
 * Aaron !McCurry (aaron.mccurry at nearinfinity dot com)
 * Scott Leberknight (scott.leberknight at nearinfinity dot com)
 * Ryan Gimmy (ryan.gimmy at nearinfinity dot com)
 * Tim Williams (twilliams at apache dot org)
 * Patrick Hunt (phunt at apache dot org)
 * Doug Cutting (cutting at apache dot org)

== Affiliations ==
 * Aaron !McCurry, Near Infinity
 * Scott Leberknight, Near Infinity
 * Ryan Gimmy, Near Infinity
 * Patrick Hunt, Cloudera
 * Doug Cutting, Cloudera

== Sponsors ==
 * Champion: Patrick Hunt

== Nominated Mentors ==
 * Tim Williams  (twilliams at apache dot org)
 * Doug Cutting (cutting at apache dot org)
 * Patrick Hunt (phunt at apache dot org)

== Sponsoring Entity ==
 * Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message