incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <ober...@civicscience.com>
Subject hadoop/pig notes
Date Wed, 08 Jun 2011 21:11:21 GMT
I decided to try out hadoop/pig + cassandra.  I had my ups and downs to get
the script I wanted to run to work.  I'm sure everyone who tries will have
their own experiences/problems, but mine were:

-Everything I need to know was in
http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and
http://wiki.apache.org/cassandra/HadoopSupport

-Java is really picky about hostnames.  I'm in EC2, and rather than rely on
DNS, I basically have all of my machines share an /etc/hosts file.  But, the
command line "hostname" wasn't returning the same thing as in /etc/hosts,
which caused all kinds of weird hadoop issues at first.  (I had hostname as
"foo" and /etc/hosts had "foo.prod").

-I forgot I had iptables on.  It's always easier to not have firewalls to
start (this is true when configuring anything of course)

-Use the same version of everything everywhere.  And for hadoop/pig, I was
having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1.

-For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and
there isn't a standard, and it seems arbitrary.  I used 8021, based on notes
in a case somewhere from hadoop (I think trying to standardize).

It took me awhile to figure the syntax of Pig Latin out, but I finally
managed to get a script that does a count of all columns in a column family:
rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage();
filter_rows = FILTER rows BY $1 is not null;
counts = FOREACH filter_rows GENERATE COUNT($1);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
dump sum_of_bag;

I'm trying to see the impact of running hadoop on the same servers as
cassandra now.  And yes, I've seen the note in the wiki about the clever
partitioning of cassandra nodes to allow for "web latency" nodes + "hadoop
processing" nodes :-)

Mime
View raw message