hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Using Hadoop in non-typical large scale user-driven environment
Date Wed, 02 Dec 2009 22:25:30 GMT

On Dec 2, 2009, at 4:08 PM, Habermaas, William wrote:

> Hadoop isn't going to like losing its datanodes when people shutdown their computers.

Of course, that's what makes it a fun project ;)

Maciej, this is definitely possible, but it is a large project.  My recommendations are:
1) Talk to the condor folks who are working on doing a Hadoop-on-Demand like system integrated
with Condor.  Condor has a huge number of knobs to do things like shut down jobs when mouse/keyboard
activity is detected.  Also works on Windows.
2) See the new code slated for 0.21.0 that gives you a pluggable framework for data placement.
 This would allow you to pick and chose which hosts your data goes to (as it will have to
go away when people come back)
3) In conjunction with (2), talk to David Anderson's research team at Berkeley.  IIRC, he
had a grad student doing along the lines of "in order for a service to have 99% uptime, how
many BOINC hosts must it be running on?".  Similarly, you should be able to get good availability
by replicating it to enough different hosts (although BOINC was lucky in that it could run
during "night hours" of any time zone across the world)
4) Security.  Haven't even begun to think about how you'd secure this.

There's lots of challenges and good hard problems to think about.  No guarantee of success.
 I guess that's what makes it a research project.


> More importantly, when the datanodes are running, your users will be impacted by data
replication. Unlike Seti, Hadoop doesn't know when the user's screensaver is running so it
will start doing things when it feels like it.
> Can someone else comment on whether HOD (hadoop-on-demand) would fit this scenario? 
> Bill   
> -----Original Message-----
> From: Maciej Trebacz [mailto:maciej.trebacz@gmail.com] 
> Sent: Wednesday, December 02, 2009 4:50 PM
> To: common-user@hadoop.apache.org
> Subject: Using Hadoop in non-typical large scale user-driven environment
> First of all, I'd like to say hi to all people on the list.
> I ran across Hadoop and Cloudera projects recently, and I was
> immediately intrigued with it, because I'm in the middle of writing a
> project that will use large scale distributed computing for a degree
> in my school. It seems like a perfect tool for me to use, but I have
> some questions to get sure this is the right tool for my needs.
> Project I'm making assumes that there is one master node which is
> distributing data and there are several (in theory, hundreds,
> thousands or more) slave nodes. To this point, this is exactly what
> Hadoop is for. But now is the tricky part. I want the slaves to be
> computers that are used by people everyday. Think SETI@Home. So user
> installs Hadoop client and ideally - forgets about it, and his
> computer helps to do the computations. Also, user will not want to
> spend much of his hard drive for the computation data.
> The problem with this model, as far as I understand, is that users
> will often shut down their computers (for whatever reason), once a day
> or even more. Will that be a big problem for Hadoop server to handle?
> I mean, I am afraid that most of processing power and bandwidth will
> be used for controlling the traffic in the network and it will not be
> effective.
> I will appreciate any opinion in this case.
> -- 
> Best regards,
> Maciej "mav" Trębacz from Poland.

View raw message