hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David B. Ritch" <david.ri...@gmail.com>
Subject Re: editing etc hosts files of a cluster
Date Tue, 20 Oct 2009 03:09:14 GMT
Most of the communication and name lookups within a cluster refer to
other nodes within that same cluster.  It is usually not a big deal to
put all the systems from a cluster in a single hosts file, and rsync it
around the cluster.  (Consider using prsync, which comes with pssh,
http://www.theether.org/pssh/, or your favorite cluster management
Editing each individually clearly doesn't scale; but editing it once and
replicating it does.

Is a large hosts file less efficient than nscd or a caching DNS server
for nodes within the cluster?



On 10/19/2009 8:02 PM, Edward Capriolo wrote:
> On Mon, Oct 19, 2009 at 7:17 PM, Allen Wittenauer
> <awittenauer@linkedin.com> wrote:
>> On 10/19/09 11:46 AM, "Edward Capriolo" <edlinuxguru@gmail.com> wrote:
>>> I am interested in your post. What has caused you to run caching DNS
>>> servers on each of your nodes? Is this a hadoop specific problem or a
>>> problem  specific to your implementation?
>> Hadoop does a -tremendous- amount of hostname lookups.  If you don't have
>> either nscd or a local DNS caching server, you are likely throwing what
>> could be some significant performance gains away.
>>> My assumption here is that a hadoop cluster of say 1000 nodes would
>>> repeatedly talk to the same 1000 nodes.
>> ... and that's the catch!  Every node running the DFSClient code or being
>> called out from a map/reduce task is a potential hostname that would need be
>> resolved.  Just think about something like distcp.
>> Also note that this is before we talk about monitoring, any other naming
>> services, CNAMEs, multi-As, etc, that get built as a normal part of running
>> an infrastructure.
>>> Are you saying that nscd is
>>> inadequacy to handle the size of the cache, or nscd is not very
>>> efficient? What exactly is the reason you are running a caching DNS
>>> server on each node?
>> In the case of Yahoo!, we had (or, at least, a perception) that we had or
>> were going to have jobs that did a lot of direct DNS lookups and/or
>> accessed/referenced things outside of the local grid.  Also note that a DNS
>> caching server is going to store more information about hostnames than a
>> simple host to IP service like nscd.
>> Hypothetical:  Let's say I'm building rules for a spam filter and part of my
>> process is to look up the MX record for a given host.  nscd isn't going to
>> help you there.
>> In the case of LinkedIn, the jury is still out.  I suspect we don't have
>> nscd.conf tuned correctly.  Our grid isn't that big, our connections in/out
>> are fairly small, etc. It has been one of the things on my todo list since I
>> got hired here 2 months ago. :)
>> [For the record, I'm not one of those crazy people who turns off nscd
>> because I had a bad experience with a  broken version five years ago.  In
>> the case of Yahoo!, I was the crazy person who started insisting we turn it
>> on, albeit not for hosts.]
> Cool thanks for the info.
> I have found NSCD to be absolutely essential in most/all situations.
> Whenever I would truss processes on OS'es without NSCD (say freebsd
> 6.2) I would see numerous repeated 'stat' against /etc/passwd and
> /etc/group.
> If you are doing users and groups through LDAP nscd is super important
> as well. Your not going to want to make a series of lookups each stat.
> I would think the most efficient implementation would be nscd and a
> local caching server in that case. NSCD should be very efficient since
> it is done through libraries, dns lookups have to open sockets
> (overhead). However I can see your point nscd can not do other types
> of records.

View raw message