Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Sun, 22 Jun 2014 19:33:24 +0000 (UTC)
From: "Paulo Motta (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12722941.1403324711330.25510.1403465604246@arcas>
In-Reply-To: <JIRA.12722941.1403324711330@arcas>
References: <JIRA.12722941.1403324711330@arcas>
Subject: [jira] [Commented] (CASSANDRA-7431) Hadoop integration does not
 perform reverse DNS lookup correctly on EC2
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-7431?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
14040220#comment-14040220 ]=20

Paulo Motta commented on CASSANDRA-7431:
----------------------------------------

I guess Hadoop's InputSplit.getLocations() method (implemented by ColumnFam=
ilySplit) expects a list of hostnames to be able to schedule local tasks, s=
ince task trackers are identified by hostnames, not IPs.

Using only private IPs in Hadoop is not feasible because you may want to ac=
cess task tracker WEB interfaces from outside EC2, so it's handy to use EC2=
 public DNS (ec2-*.compute-1.amazonaws.com) to identify hadoop trackers, si=
nce this DNS is resolved internally to private IPs and externally to public=
 IPs.

Another issue when the C* cluster uses public IPs as broadcast_address (suc=
h as with the EC2MultiRegionSnitch), is that Hadoop tasks will access Colum=
nFamilySplits of non-local tasks via the public IP, which costs $0.01 per G=
B. If the ColumnFamilySplit's locations are EC2 hostnames instead (ec2-*.co=
mpute-1.amazonaws.com), then that will be internally resolved by Amazon to =
the private IP, lowering transfer costs for non-local tasks.

> Hadoop integration does not perform reverse DNS lookup correctly on EC2
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-7431
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> The split assignment on AbstractColumnFamilyInputFormat:247 peforms a rev=
erse DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (t=
ask trackers are identified by hostnames).
> However, the reverse lookup of an EC2 IP does not yield the EC2 hostname =
of that endpoint when running from an EC2 instance due to the use of InetAd=
dress.getHostname().
> In order to show this, consider the following piece of code:
> {code:title=3DDnsResolver.java|borderStyle=3Dsolid}
> public class DnsResolver {
>     public static void main(String[] args) throws Exception {
>         InetAddress namenodePublicAddress =3D InetAddress.getByName(args[=
0]);
>         System.out.println("getHostAddress: " + namenodePublicAddress.get=
HostAddress());
>         System.out.println("getHostName: " + namenodePublicAddress.getHos=
tName());
>     }
> }
> {code}
> When this code is run from my machine to perform reverse lookup of an EC2=
 IP, the output is:
> {code:none}
> =E2=9E=9C  java DnsResolver 54.201.254.99
> getHostAddress: 54.201.254.99
> getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
> {code}
> When this code is executed from inside an EC2 machine, the output is:
> {code:none}
> =E2=9E=9C  java DnsResolver 54.201.254.99
> getHostAddress: 54.201.254.99
> getHostName: 54.201.254.99
> {code}
> However, when using linux tools such as "host" or "dig", the EC2 hostname=
 is properly resolved from the EC2 instance, so there's some problem with J=
ava's InetAddress.getHostname() and EC2.
> Two consequences of this bug during AbstractColumnFamilyInputFormat split=
 definition are:
> 1) If the Hadoop cluster is configured to use EC2 public DNS, the localit=
y will be lost, because Hadoop will try to match the CFIF split location (p=
ublic IP) with the task tracker location (public DNS), so no matches will b=
e found.
> 2) If the Cassandra nodes' broadcast_address is set to public IPs, all ha=
doop communication will be done via the public IP, what will incurr additio=
nal transference charges. If the public IP is mapped to the EC2 DNS during =
split definition, when the task is executed, ColumnFamilyRecordReader will =
resolve the public DNS to the private IP of the instance, so there will be =
not additional charges.
> A similar bug was filed in the WHIRR project:=20
> https://issues.apache.org/jira/browse/WHIRR-128


--
This message was sent by Atlassian JIRA
(v6.2#6252)