hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <...@adobe.com>
Subject RE: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
Date Tue, 05 May 2009 17:46:16 GMT
The slide deck talks about possible bundling of various existing Apache technologies in distributed
systems as well as some Java API to access Amazon cloud services.

What hasn't been discussed is the difference between a "traditional distributed architecture"
and "the cloud".  They are "close" but not close enough to be treated the "same".  In my opinion,
some of the distributed technology in Apache need to be enhanced in order to fit into the
cloud more effectively.

Let me focus in some cloud characteristics that our existing Apache distributed technologies
hasn't been paying attention to:  Extreme elasticity, Trust boundary, and cost awareness.

Extreme elasticity
Most distributed technologies treat machine shutdown/startup a relatively infrequent operation
and hasn't tried hard to minimize the cost of handling this situations.  Look at Hadoop as
an example, although it can handle machine crashes gracefully, it doesn't handle cloud bursting
scenario well (ie: when a lot of machines is added to Hadoop cluster).  You need to run a
data redistribution task in the background and slow down your existing job.

Another example is that many scripts in Hadoop relies on config file that specify each cluster
member's IP address.  In a cloud environment, IP address is unstable so we need to have a
discovery mechanism and also rework the scripts.

Trust boundary
Most distributed technologies are assuming a homogeneous environment (every member has the
same degree of trust), which is not the case in the cloud environment.  Additional processing
(cryptographic operation for data transfer and storage) may be necessary when dealing with
machines running in the cloud.

Cost awareness
Same reason as they are assuming a homogeneous environment, the scheduler is not aware of
the involved cost when they move data across the cloud boundary (especially bandwidth cost
is relatively high).  The Hadoop MapReduce scheduler need to be more sophisticated when scheduling
where to start the Mapper and Reducer.  Similarly, when making the replica placement decision,
HDFS needs to be aware of which machine is located in which cloud.

That said, I am not discounting the existing Apache technology.  In fact, we have already
made a good step.  We just need to go further.


-----Original Message-----
From: Bradford Stephens [mailto:bradfordstephens@gmail.com] 
Sent: Tuesday, May 05, 2009 9:53 AM
To: core-user@hadoop.apache.org
Subject: Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

I read through the deck and sent it around the company. Good stuff!
It's going to be a big help for trying to get the .NET Enterprise
people wrapping their heads around web-scale data.

I must admit "Apache Cloud Computing Edition" is sort of unwieldy to
say verbally, and frankly "Java Enterprise Edition" is a taboo phrase
at a lot of projects I've had. Guilt by association. I think I'll call
it "Apache Cloud Stack", and reference "Apache Cloud Computing
Edition" in my deck. When I think "Stack", I think of a suite of
software that provides all the pieces I need to solve my problem :)

On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <stevel@apache.org> wrote:
> Bradford Stephens wrote:
>> Hey all,
>> I'm going to be speaking at OSCON about my company's experiences with
>> Hadoop and Friends, but I'm having a hard time coming up with a name
>> for the entire software ecosystem. I'm thinking of calling it the
>> "Apache CloudStack". Does this sound legit to you all? :) Is there
>> something more 'official'?
> We've been using "Apache Cloud Computing Edition" for this, to emphasise
> this is the successor to Java Enterprise Edition, and that it is cross
> language and being built at apache. If you use the same term, even if you
> put a different stack outline than us, it gives the idea more legitimacy.
> The slides that Andrew linked to are all in SVN under
> http://svn.apache.org/repos/asf/labs/clouds/
> we have a space in the apache labs for "apache clouds", where we want to do
> more work integrating things, and bringing the idea of deploy and test on
> someone else's infrastructure mainstream across all the apache products. We
> would welcome your involvement -and if you send a draft of your slides out,
> will happily review them
> -steve

View raw message