From Steve Loughran <ste...@apache.org>
Subject Re: Hadoop on windows with bat and ant scripts
Date Tue, 14 Jun 2011 10:59:11 GMT
On 13/06/11 15:27, Bible, Landy wrote:
> On 06/13/2011 07:52 AM, Loughran, Steve wrote:
>>> On 06/10/2011 03:23 PM, Bible, Landy wrote:
>>> I'm currently running HDFS on Windows 7 desktops.  I had to create a hadoop.bat
that provided the same functionality of the shell scripts, and some Java Service Wrapper configs
to run the DataNodes and NameNode as windows services.  Once I get my system more functional
I plan to do a write up about how I did it, but it wasn't too difficult.  I'd also like to
see Hadoop become less platform dependent.
>> why? Do you plan to bring up a real Windows server datacenter to test it on?
> Not a datacenter, but a large-ish cluster of desktops, yes.
>> Whether you like it or not, all the big Hadoop clusters run on Linux
> I realize that, I use Linux wherever possible, much to the annoyance of my Windows only
co-workers. However, for my current project, I'm using all the Windows 7 and Vista desktops
at my site as a storage cluster.   The first idea was to run Hadoop on Linux in a VM in the
background on each desktop, but that seemed like overkill.  The point here is to use the resources
we have but aren't using, rather than buy new resources.  Academia is funny like that.

I understand. One trick my local university has done is to buy a set of 
servers with HDDs for their HDFS filestore, but also hook them up to 
their grid scheduler (condor? Torque?) so the existing grid jobs see a 
set of machines for their work, while the Job tracker sees a farm of 
worker nodes with local data. Some more work there on reporting 
busy-state to each job scheduler would be nice, so that the Task 
Trackers would say "busy" when running grid jobs, and vice-versa

>>>    So far, I've been unable to make MapReduce work correctly.  The services run,
but things don't work, however I suspect that this is due to DNS not working correctly in
my environment.
>> yes, that's part of the anywhere you have to fix. Edit the host tables so that DNS
and reverse DNS appears to work. That's c:\windows\system32\drivers\etc\hosts, unless on a
win64 box it moves.
> Why does Hadoop even care about DNS?   Every node checks in with the NameNode and JobTrackers,
so they know where they are, why not just go pure IP based and forget DNS.   Managing the
hosts file is a pain... even when you automate it, it just seems unneeded.

there's been some fixes in 0.21 and 0.22, but still there may be a 
tendency to look things up.


Hadoop doesn't like coming up on multi-homed servers or having separate 
in-cluster and long-haul hostnames. Yes, this all needs fixing. I think 
the reason it hasn't been fixed is that the big datacentres do have well 
configured networks, caching DNS servers in every worker node, etc, and 
all is well. It's the home networks and the less-consistently set up 
ones (mine, and perhaps yours) where the trouble shows up. We get to 
file the bugs and fix the problems.

