Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 914664B2E for ; Wed, 8 Jun 2011 21:12:13 +0000 (UTC) Received: (qmail 19151 invoked by uid 500); 8 Jun 2011 21:12:11 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 19065 invoked by uid 500); 8 Jun 2011 21:12:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 19057 invoked by uid 99); 8 Jun 2011 21:12:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jun 2011 21:12:11 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 209.85.214.44 is neither permitted nor denied by domain of oberman@civicscience.com) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jun 2011 21:12:01 +0000 Received: by bwz13 with SMTP id 13so904020bwz.31 for ; Wed, 08 Jun 2011 14:11:41 -0700 (PDT) Received: by 10.204.42.69 with SMTP id r5mr986685bke.52.1307567501184; Wed, 08 Jun 2011 14:11:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.51.12 with HTTP; Wed, 8 Jun 2011 14:11:21 -0700 (PDT) X-Originating-IP: [24.23.118.38] From: William Oberman Date: Wed, 8 Jun 2011 17:11:21 -0400 Message-ID: Subject: hadoop/pig notes To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec554e01a6d8f6804a539c828 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec554e01a6d8f6804a539c828 Content-Type: text/plain; charset=ISO-8859-1 I decided to try out hadoop/pig + cassandra. I had my ups and downs to get the script I wanted to run to work. I'm sure everyone who tries will have their own experiences/problems, but mine were: -Everything I need to know was in http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and http://wiki.apache.org/cassandra/HadoopSupport -Java is really picky about hostnames. I'm in EC2, and rather than rely on DNS, I basically have all of my machines share an /etc/hosts file. But, the command line "hostname" wasn't returning the same thing as in /etc/hosts, which caused all kinds of weird hadoop issues at first. (I had hostname as "foo" and /etc/hosts had "foo.prod"). -I forgot I had iptables on. It's always easier to not have firewalls to start (this is true when configuring anything of course) -Use the same version of everything everywhere. And for hadoop/pig, I was having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1. -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and there isn't a standard, and it seems arbitrary. I used 8021, based on notes in a case somewhere from hadoop (I think trying to standardize). It took me awhile to figure the syntax of Pig Latin out, but I finally managed to get a script that does a count of all columns in a column family: rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage(); filter_rows = FILTER rows BY $1 is not null; counts = FOREACH filter_rows GENERATE COUNT($1); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; I'm trying to see the impact of running hadoop on the same servers as cassandra now. And yes, I've seen the note in the wiki about the clever partitioning of cassandra nodes to allow for "web latency" nodes + "hadoop processing" nodes :-) --bcaec554e01a6d8f6804a539c828 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I decided to try out hadoop/pig + cassandra.=A0 I had my ups and downs to g= et the script I wanted to run to work.=A0 I'm sure everyone who tries w= ill have their own experiences/problems, but mine were:

-Everything = I need to know was in http://hadoop.apache.org/common/docs/r0.20.2/cluster= _setup.html and http://wiki.apache.org/cassandra/HadoopSupport

-Java is really picky about hostnames.=A0 I'm in EC2, and rather th= an rely on DNS, I basically have all of my machines share an /etc/hosts fil= e.=A0 But, the command line "hostname" wasn't returning the s= ame thing as in /etc/hosts, which caused all kinds of weird hadoop issues a= t first.=A0 (I had hostname as "foo" and /etc/hosts had "foo= .prod").

-I forgot I had iptables on.=A0 It's always easier to not have fire= walls to start (this is true when configuring anything of course)

-U= se the same version of everything everywhere.=A0 And for hadoop/pig, I was = having issues until I used the combination of hadoop= -0.20.2 + pig-0.8.1.

-For
hadoop's = mapred-site.xml you HAVE to supply a port (hostname:port), and there isn= 9;t a standard, and it seems arbitrary.=A0 I used 8021, based on notes in a= case somewhere from hadoop (I think trying to standardize).

It took me awhile to figure the syntax of Pig Latin out, but I finally = managed to get a script that does a count of all columns in a column family= :
rows =3D LOAD 'cassandra://keyspace/= columnfamily' USING CassandraStorage();
filter_rows =3D FILTER rows BY $1 is not null;
counts =3D FOREACH filter= _rows GENERATE COUNT($1);
counts_in_bag =3D GROUP counts ALL;
sum_of= _bag =3D FOREACH counts_in_bag=A0 GENERATE SUM($1);
dump sum_of_bag;

I'm trying to see the impact of running hado= op on the same servers as cassandra now.=A0 And yes, I've seen the note= in the wiki about the clever partitioning of cassandra nodes to allow for = "web latency" nodes + "hadoop processing" nodes :-)

--bcaec554e01a6d8f6804a539c828--