hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
Date Wed, 06 May 2009 10:00:37 GMT
Edward Capriolo wrote:
> 'cloud computing' is a hot term. According to the definition provided
> by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
> Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.
> Hadoop is scalable, with HOD it is dynamically scalable.
> I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for
> 'utility computing'. as managing the stack and getting started is
> quite a complex process.

Exactly. Which is why the Apache Clouds proposal emphasises

-Lightweight front end: low Wattage, stateless nodes for web GUI, bonded 
to the back end

-instrumentation for liveness and load monitoring. Hadoop has a lot of 
this, I'm trying to add more, but we want it everywhere.

-Resource Management: bringing up and tearing down nodes by asking the 
infrastructure. Some Apache projects have done this but only for EC2 and 
only for their layer of the stack. You need something that keeps track 
of everything and acts in your interests, not those of the datacentre 

-Packaging for fully automated install/deploy on Linux systems (=rpm and 

-A development process in which the tools push the code out to a 
targeted infrastracture even for test runs

Hadoop and friends are part of this, they are a very interesting 
foundation, but they are only part of the storing
> Also this stack is best running on LAN network with high speed
> interlinks. Historically the "Cloud" is composed of WAN links. An
> implication of Cloud Computing is that different services would be
> running in different geographical locations which is not how hadoop is
> normally deployed.
> I believe 'Apache Grid Stack' would be a more fitting.
> http://en.wikipedia.org/wiki/Grid_computing
> Grid computing (or the use of computational grids) is the application
> of several computers to a single problem at the same time — usually to
> a scientific or technical problem that requires a great number of
> computer processing cycles or access to large amounts of data.

Classic Grid computing - OGSi/OGSA is something I want to steer clear 
of. Historically, you end up in WS-* and computer management politics. 
Furthermore, OGSA never had a good use case except "rewrite your apps 
for the cloud and they will be better". They (lets be fair, we) also 
focused too much on CPU scheduling, not on storage.

> Grid computing via the Wikipedia definition describes exactly what
> hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a
> 'cloud computing' IMHO

To be precise: without a dynamic infrastructure provider that is more 
than just AWS: it could be Sun/Oracle, IBM/google, HP/Intel/Yahoo!, it 
could be your ops team and Eucalyptus.

The other hardware/service vendors are working on this infrastructure. 
Apache doesn't work at that level, but if we provide the code to run on 
all of them, we give the users the independence of a particular 
infrastructure provider

View raw message