incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <>
Subject Re: cassandra/hadoop/pig
Date Wed, 06 Jul 2011 19:52:24 GMT
That makes sense.  The problem is I jumped directly to using pig, which is
abstracting some of the data flow from me.  I guess I'll have to figure out
what it's doing under the covers, to know how to optimize/fix bottlenecks.

But for now, I'm taking this information to mean "I should run datanodes
with HDFS on larger non-root disks on all tasktracker nodes to ensure my pig
scripts work, until I'm willing to either write the M/R code myself, or
figure out how to optimize pig and/or the pig script".


On Wed, Jul 6, 2011 at 3:29 PM, Edward Capriolo <>wrote:

> On Wed, Jul 6, 2011 at 2:48 PM, William Oberman <>wrote:
>> I have a few cassandra/hadoop/pig questions.  I currently have things set
>> up in a test environment, and for the most part everything works.  But,
>> before I start to roll things out to production, I wanted to check
>> on/confirm some things.
>> When I originally set things up, I used:
>> One difference I noticed between the two guides, which I ignored at the
>> time, was how "datanodes" are treated.  The wiki said "At least one node in
>> your cluster will also need to be a datanode. That's because Hadoop uses
>> HDFS to store information like jar dependencies for your job, static data
>> (like stop words for a word count), and things like that - it's the
>> distributed cache. It's a very small amount of data but the Hadoop cluster
>> needs it to run properly".  But, the hadoop guide (if you follow it blindly
>> like I did), creates a datanode on all TaskTracker nodes.  I _think_ that is
>> controlled by the conf/slaves file, but I haven't proved that yet.  Is there
>> any good reason to run datanodes on only the JobTracker vs. on all nodes?
>> If I should only run it on the JobTracker, how do I properly stop the
>> datanodes from starting automatically (when both start-dfs and start-mapred
>> seem to draw from the same slaves file)?
>> I noticed a second issue/oddness with datanodes, in that the HDFS data
>> isn't always small.  The other day I ran out of disk running my pig script.
>> I checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2
>> (and /tmp is on the boot device) which is only 10G by default.  Do other
>> people put HDFS on a different disk?  If yes, I'll really want to only run
>> one datanode, as I don't want to re-template all of my cassandra nodes to
>> have HDFS disks vs. one new JobTracker node.
>> In terms of hardware, I am running small instances (32bit, 2GB) in the
>> test cluster, while my production cluster is larges (64bit, 7 or 8GB).  I
>> was going to check the performance impact there, but even on smalls in test
>> I was able to run hadoop jobs while serving web requests.  I am wondering if
>> smalls are causing the high HDFS usage though (I think data might "spill"
>> more, if I'm understanding things correctly).
>> If these are more hadoop then cassandra questions, let me know and I'll
>> move my questions around.
>> I did want to mention that these are small details compared to the amount
>> of complicated things that worked like a charm during my configuration and
>> testing of the combination of cassandra/hadoop/pig.  It was impressive :-)
>> Thanks!
>> will
> The logic that "only one datanode is needed" is not an absolute truth. If
> your jobs use ColumnFamilyInputFormat to read and write to
> ColumnFamilyOutputFormat then technically you only need one DataNode to hold
> the distributed cache. However, if you have a large amount of intermediate
> results or even a multiphase job that has to persist data between phases
> (this is very very common) then that single DataNode is a bottleneck. Most
> hadoop clusters run a DataNode and TaskTracker on each slave.
> Most situations would use datanodes very heavily, for example suppose you
> have 4 map/reduce jobs to run on the same Cassandra data. Ingesting the data
> from Cassandra at the beginning of each job might would be wasteful. It
> might be better to take the data into HDFS during the first job and then
> save it. Your subsequent jobs could use that instead of re-acquiring it from
> Cassandra.

View raw message