hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Environment consideration for a research on scheduling
Date Mon, 26 Sep 2011 09:41:48 GMT
On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote:
> If you are starting from scratch with no prior Hadoop install experience I would configure
stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality
at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution
then SCM / their rpms will greatly simplify both the bin installs as well as the user creation.
> Your VM route will most likely work but I can imagine the amount of hiccups during migration
from that to the real cluster will not make it worth your time.
> Matt
> -----Original Message-----
> From: Merto Mertek [mailto:masmertoz@gmail.com]
> Sent: Friday, September 23, 2011 10:00 AM
> To: common-user@hadoop.apache.org
> Subject: Environment consideration for a research on scheduling
> Hi,
> in the first phase we are planning to establish a small cluster with few
> commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
> 10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
> version 0.20.203 with missing
> libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
> Would you suggest any other version?

I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL 
and CentOS are the platform of choice in the server side.

> In the second phase we are planning to analyse, test and modify some of
> hadoop schedulers.

The main schedulers used by Y! and FB are fairly tuned for their 
workloads, and not apparently something you'd want to play with. There 
is at least one other scheduler in the contribs/ dir to play with.

the other thing about scheduling is that you may have a faster 
development cycle if, instead of working on a real cluster, you simulate 
it and multiples of real time; using stats collected from your own 
workload by way of the gridmix2 tools. I've never done scheduling work, 
but think there's some stuff there to do that. if not, it's a possible 

Be aware that the changes in 0.23+ will change resource scheduling; this 
may be a better place to do development with a plan to deploy in 2012. 
Oh, and get on the mapreduce lists, esp, the -dev list, to discuss issues

> The information contained in this email may be subject to the export control laws and
regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR) and sanctions
regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information
you are obligated to comply with all
> applicable U.S. export laws and regulations.

I have no idea what that means but am not convinced that reading an 
email forces me to comply with a different country's rules

View raw message