Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8DC4E11D46 for ; Sun, 22 Jun 2014 19:33:24 +0000 (UTC) Received: (qmail 79113 invoked by uid 500); 22 Jun 2014 19:33:24 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 79073 invoked by uid 500); 22 Jun 2014 19:33:24 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 79059 invoked by uid 99); 22 Jun 2014 19:33:24 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Jun 2014 19:33:24 +0000 Date: Sun, 22 Jun 2014 19:33:24 +0000 (UTC) From: "Paulo Motta (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-7431?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14040220#comment-14040220 ]=20 Paulo Motta commented on CASSANDRA-7431: ---------------------------------------- I guess Hadoop's InputSplit.getLocations() method (implemented by ColumnFam= ilySplit) expects a list of hostnames to be able to schedule local tasks, s= ince task trackers are identified by hostnames, not IPs. Using only private IPs in Hadoop is not feasible because you may want to ac= cess task tracker WEB interfaces from outside EC2, so it's handy to use EC2= public DNS (ec2-*.compute-1.amazonaws.com) to identify hadoop trackers, si= nce this DNS is resolved internally to private IPs and externally to public= IPs. Another issue when the C* cluster uses public IPs as broadcast_address (suc= h as with the EC2MultiRegionSnitch), is that Hadoop tasks will access Colum= nFamilySplits of non-local tasks via the public IP, which costs $0.01 per G= B. If the ColumnFamilySplit's locations are EC2 hostnames instead (ec2-*.co= mpute-1.amazonaws.com), then that will be internally resolved by Amazon to = the private IP, lowering transfer costs for non-local tasks. > Hadoop integration does not perform reverse DNS lookup correctly on EC2 > ----------------------------------------------------------------------- > > Key: CASSANDRA-7431 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7431 > Project: Cassandra > Issue Type: Bug > Components: Hadoop > Reporter: Paulo Motta > Assignee: Paulo Motta > > The split assignment on AbstractColumnFamilyInputFormat:247 peforms a rev= erse DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (t= ask trackers are identified by hostnames). > However, the reverse lookup of an EC2 IP does not yield the EC2 hostname = of that endpoint when running from an EC2 instance due to the use of InetAd= dress.getHostname(). > In order to show this, consider the following piece of code: > {code:title=3DDnsResolver.java|borderStyle=3Dsolid} > public class DnsResolver { > public static void main(String[] args) throws Exception { > InetAddress namenodePublicAddress =3D InetAddress.getByName(args[= 0]); > System.out.println("getHostAddress: " + namenodePublicAddress.get= HostAddress()); > System.out.println("getHostName: " + namenodePublicAddress.getHos= tName()); > } > } > {code} > When this code is run from my machine to perform reverse lookup of an EC2= IP, the output is: > {code:none} > =E2=9E=9C java DnsResolver 54.201.254.99 > getHostAddress: 54.201.254.99 > getHostName: ec2-54-201-254-99.compute-1.amazonaws.com > {code} > When this code is executed from inside an EC2 machine, the output is: > {code:none} > =E2=9E=9C java DnsResolver 54.201.254.99 > getHostAddress: 54.201.254.99 > getHostName: 54.201.254.99 > {code} > However, when using linux tools such as "host" or "dig", the EC2 hostname= is properly resolved from the EC2 instance, so there's some problem with J= ava's InetAddress.getHostname() and EC2. > Two consequences of this bug during AbstractColumnFamilyInputFormat split= definition are: > 1) If the Hadoop cluster is configured to use EC2 public DNS, the localit= y will be lost, because Hadoop will try to match the CFIF split location (p= ublic IP) with the task tracker location (public DNS), so no matches will b= e found. > 2) If the Cassandra nodes' broadcast_address is set to public IPs, all ha= doop communication will be done via the public IP, what will incurr additio= nal transference charges. If the public IP is mapped to the EC2 DNS during = split definition, when the task is executed, ColumnFamilyRecordReader will = resolve the public DNS to the private IP of the instance, so there will be = not additional charges. > A similar bug was filed in the WHIRR project:=20 > https://issues.apache.org/jira/browse/WHIRR-128 -- This message was sent by Atlassian JIRA (v6.2#6252)