Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 36DCE64D8 for ; Wed, 6 Jul 2011 19:30:06 +0000 (UTC) Received: (qmail 82975 invoked by uid 500); 6 Jul 2011 19:30:04 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 82862 invoked by uid 500); 6 Jul 2011 19:30:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 82853 invoked by uid 99); 6 Jul 2011 19:30:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jul 2011 19:30:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-iy0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jul 2011 19:29:56 +0000 Received: by iye7 with SMTP id 7so249769iye.31 for ; Wed, 06 Jul 2011 12:29:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=k0bL8NHLv/j+HcCn733v6J/UdMxtmdYnnwuT2HAQRLs=; b=Wnmf25lloYQnUUH6oFMU8BKLF2tzO6Et/Km9lULqRy7OP/bBLSkyNsntYzI43sQ5vj Q8cLaQ9AWz7MLPEXPPe6sgPqpvHnesgRunw+kBlD1CCcHUvz4OsdSnada1tKfwbBB6f3 e3UtMWYrfJgHl2G4wDDD5gO6t2enMoOtIz35I= MIME-Version: 1.0 Received: by 10.42.197.200 with SMTP id el8mr9752892icb.88.1309980575302; Wed, 06 Jul 2011 12:29:35 -0700 (PDT) Received: by 10.43.52.7 with HTTP; Wed, 6 Jul 2011 12:29:35 -0700 (PDT) In-Reply-To: References: Date: Wed, 6 Jul 2011 15:29:35 -0400 Message-ID: Subject: Re: cassandra/hadoop/pig From: Edward Capriolo To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf303dd95ada83ce04a76b9edc X-Virus-Checked: Checked by ClamAV on apache.org --20cf303dd95ada83ce04a76b9edc Content-Type: text/plain; charset=ISO-8859-1 On Wed, Jul 6, 2011 at 2:48 PM, William Oberman wrote: > I have a few cassandra/hadoop/pig questions. I currently have things set > up in a test environment, and for the most part everything works. But, > before I start to roll things out to production, I wanted to check > on/confirm some things. > > When I originally set things up, I used: > http://wiki.apache.org/cassandra/HadoopSupport > http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html > > One difference I noticed between the two guides, which I ignored at the > time, was how "datanodes" are treated. The wiki said "At least one node in > your cluster will also need to be a datanode. That's because Hadoop uses > HDFS to store information like jar dependencies for your job, static data > (like stop words for a word count), and things like that - it's the > distributed cache. It's a very small amount of data but the Hadoop cluster > needs it to run properly". But, the hadoop guide (if you follow it blindly > like I did), creates a datanode on all TaskTracker nodes. I _think_ that is > controlled by the conf/slaves file, but I haven't proved that yet. Is there > any good reason to run datanodes on only the JobTracker vs. on all nodes? > If I should only run it on the JobTracker, how do I properly stop the > datanodes from starting automatically (when both start-dfs and start-mapred > seem to draw from the same slaves file)? > > I noticed a second issue/oddness with datanodes, in that the HDFS data > isn't always small. The other day I ran out of disk running my pig script. > I checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2 > (and /tmp is on the boot device) which is only 10G by default. Do other > people put HDFS on a different disk? If yes, I'll really want to only run > one datanode, as I don't want to re-template all of my cassandra nodes to > have HDFS disks vs. one new JobTracker node. > > In terms of hardware, I am running small instances (32bit, 2GB) in the test > cluster, while my production cluster is larges (64bit, 7 or 8GB). I was > going to check the performance impact there, but even on smalls in test I > was able to run hadoop jobs while serving web requests. I am wondering if > smalls are causing the high HDFS usage though (I think data might "spill" > more, if I'm understanding things correctly). > > If these are more hadoop then cassandra questions, let me know and I'll > move my questions around. > > I did want to mention that these are small details compared to the amount > of complicated things that worked like a charm during my configuration and > testing of the combination of cassandra/hadoop/pig. It was impressive :-) > > Thanks! > > will > > The logic that "only one datanode is needed" is not an absolute truth. If your jobs use ColumnFamilyInputFormat to read and write to ColumnFamilyOutputFormat then technically you only need one DataNode to hold the distributed cache. However, if you have a large amount of intermediate results or even a multiphase job that has to persist data between phases (this is very very common) then that single DataNode is a bottleneck. Most hadoop clusters run a DataNode and TaskTracker on each slave. Most situations would use datanodes very heavily, for example suppose you have 4 map/reduce jobs to run on the same Cassandra data. Ingesting the data from Cassandra at the beginning of each job might would be wasteful. It might be better to take the data into HDFS during the first job and then save it. Your subsequent jobs could use that instead of re-acquiring it from Cassandra. --20cf303dd95ada83ce04a76b9edc Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Wed, Jul 6, 2011 at 2:48 PM, William = Oberman <o= berman@civicscience.com> wrote:
I have a few cassandra/hadoop/pig questions.=A0 I currently have things set= up in a test environment, and for the most part everything works.=A0 But, = before I start to roll things out to production, I wanted to check on/confi= rm some things.=A0

When I originally set things up, I used:
http://wiki.apache.org/ca= ssandra/HadoopSupport
http://hadoop.apache.or= g/common/docs/r0.20.203.0/cluster_setup.html

One difference I noticed between the two guides, which I ignored at the= time, was how "datanodes" are treated.=A0 The wiki said "At= least one node in your cluster will also need to be a datanode. =20 That's because Hadoop uses HDFS to store information like jar=20 dependencies for your job, static data (like stop words for a word=20 count), and things like that - it's the distributed cache. It's a = very=20 small amount of data but the Hadoop cluster needs it to run properly".= =A0 But, the hadoop guide (if you follow it blindly like I did), creates a = datanode on all TaskTracker nodes.=A0 I _think_ that is controlled by the c= onf/slaves file, but I haven't proved that yet.=A0 Is there any good re= ason to run datanodes on only the JobTracker vs. on all nodes?=A0 If I shou= ld only run it on the JobTracker, how do I properly stop the datanodes from= starting automatically (when both start-dfs and start-mapred seem to draw = from the same slaves file)?

I noticed a second issue/oddness with datanodes, in that the HDFS data = isn't always small.=A0 The other day I ran out of disk running my pig s= cript.=A0 I checked, and by default, hadoop creates HDFS in /tmp, and I'= ;m using EC2 (and /tmp is on the boot device) which is only 10G by default.= =A0 Do other people put HDFS on a different disk?=A0 If yes, I'll reall= y want to only run one datanode, as I don't want to re-template all of = my cassandra nodes to have HDFS disks vs. one new JobTracker node.=A0

In terms of hardware, I am running small instances (32bit, 2GB) in the = test cluster, while my production cluster is larges (64bit, 7 or 8GB).=A0 I= was going to check the performance impact there, but even on smalls in tes= t I was able to run hadoop jobs while serving web requests.=A0 I am wonderi= ng if smalls are causing the high HDFS usage though (I think data might &qu= ot;spill" more, if I'm understanding things correctly).=A0

If these are more hadoop then cassandra questions, let me know and I= 9;ll move my questions around.

I did want to mention that these are = small details compared to the amount of complicated things that worked like= a charm during my configuration and testing of the combination of cassandr= a/hadoop/pig.=A0 It was impressive :-)

Thanks!

will


The logic that "only one datanode is needed&quo= t; is not an absolute truth. If your jobs use ColumnFamilyInputFormat to re= ad and write to ColumnFamilyOutputFormat then technically you only need one= DataNode to hold the distributed cache. However, if you have a large amoun= t of intermediate results or even a multiphase job that has to persist data= between phases (this is very very common) then that single DataNode is a b= ottleneck. Most hadoop clusters run a DataNode and TaskTracker on each slav= e.

Most situations would use datanodes very heavily, for example suppose y= ou have 4 map/reduce jobs to run on the same Cassandra data. Ingesting the = data from Cassandra at the beginning of each job might would be wasteful. I= t might be better to take the data into HDFS during the first job and then = save it. Your subsequent jobs could use that instead of re-acquiring it fro= m Cassandra.
--20cf303dd95ada83ce04a76b9edc--