incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <jeremy.hanna1...@gmail.com>
Subject Re: hadoop/pig notes
Date Wed, 08 Jun 2011 22:39:05 GMT
I need to update the wiki with better pig info.  I did put some information in the getting
started docs of pygmalion, but it would be good to transfer that to cassandra's wiki and add
to it.
fwiw - https://github.com/jeromatron/pygmalion/wiki/Getting-Started

Thanks for the rundown William!


On Jun 8, 2011, at 4:11 PM, William Oberman wrote:

> I decided to try out hadoop/pig + cassandra.  I had my ups and downs to get the script
I wanted to run to work.  I'm sure everyone who tries will have their own experiences/problems,
but mine were:
> 
> -Everything I need to know was in http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
and http://wiki.apache.org/cassandra/HadoopSupport
> 
> -Java is really picky about hostnames.  I'm in EC2, and rather than rely on DNS, I basically
have all of my machines share an /etc/hosts file.  But, the command line "hostname" wasn't
returning the same thing as in /etc/hosts, which caused all kinds of weird hadoop issues at
first.  (I had hostname as "foo" and /etc/hosts had "foo.prod").
> 
> -I forgot I had iptables on.  It's always easier to not have firewalls to start (this
is true when configuring anything of course)
> 
> -Use the same version of everything everywhere.  And for hadoop/pig, I was having issues
until I used the combination of hadoop-0.20.2 + pig-0.8.1.
> 
> -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and there isn't
a standard, and it seems arbitrary.  I used 8021, based on notes in a case somewhere from
hadoop (I think trying to standardize).
> 
> It took me awhile to figure the syntax of Pig Latin out, but I finally managed to get
a script that does a count of all columns in a column family:
> rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage();
> filter_rows = FILTER rows BY $1 is not null;
> counts = FOREACH filter_rows GENERATE COUNT($1);
> counts_in_bag = GROUP counts ALL; 
> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1); 
> dump sum_of_bag;
> 
> I'm trying to see the impact of running hadoop on the same servers as cassandra now.
 And yes, I've seen the note in the wiki about the clever partitioning of cassandra nodes
to allow for "web latency" nodes + "hadoop processing" nodes :-)
> 


Mime
View raw message