Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2D5EB8CB3 for ; Tue, 30 Aug 2011 18:42:44 +0000 (UTC) Received: (qmail 13044 invoked by uid 500); 30 Aug 2011 18:42:41 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 12889 invoked by uid 500); 30 Aug 2011 18:42:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 12873 invoked by uid 99); 30 Aug 2011 18:42:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2011 18:42:40 +0000 X-ASF-Spam-Status: No, hits=1.9 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jeremy.hanna1234@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yw0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2011 18:42:35 +0000 Received: by ywe9 with SMTP id 9so6823312ywe.31 for ; Tue, 30 Aug 2011 11:42:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=MDCiXPjSUoBfhzkbxO8QESBBP6pByGJFByr8v2zjE+U=; b=PtlbO6qE/SCrma+ReAXLzxTVMSZ/hybmhdzHMukR6gPU1HNGGuAA5v9D6evP19xMmL Bd/5QO0upRlN+zzn3P8/EKMf8nUy+L3E48l0gQmjKUZyn/UBFKa6JLP8MP9QcJXgv8T6 0oRMUzYwSqz4/cMsnGEu2cK7J3vVTBshy2+nk= Received: by 10.236.192.129 with SMTP id i1mr14495086yhn.47.1314729734417; Tue, 30 Aug 2011 11:42:14 -0700 (PDT) Received: from [192.168.1.69] (108-90-0-32.lightspeed.austtx.sbcglobal.net [108.90.0.32]) by mx.google.com with ESMTPS id t42sm1308752yhd.33.2011.08.30.11.42.12 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 30 Aug 2011 11:42:13 -0700 (PDT) Subject: Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: text/plain; charset=iso-8859-1 From: Jeremy Hanna In-Reply-To: Date: Tue, 30 Aug 2011 13:42:11 -0500 Cc: user@pig.apache.org, user@hive.apache.org Content-Transfer-Encoding: quoted-printable Message-Id: <72185C45-FCF2-4791-B51A-1070BCBDF8B3@gmail.com> References: <6093B199-A124-42B7-9F88-4A2D8482F709@gmail.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1244.3) I've tried to help out with some UDFs and references that help with our = use case: https://github.com/jeromatron/pygmalion/ There are some brisk docs on pig as well that might be helpful: = http://www.datastax.com/docs/0.8/brisk/about_pig On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote: > Thanks Jeremy for your response. That gives me some encouragement, = that I might be on that right track. >=20 > I think I need to try out more stuff before coming to a conclusion on = Brisk. >=20 > For Pig operations over Cassandra, I only could find = http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there = any other resource that you can point me to? There seems to be a lack of = samples on this subject. >=20 > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna = wrote: > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to = potentially move to Brisk because of the simplicity of operations there. >=20 > Not sure what you mean about the true power of Hadoop. In my mind the = true power of Hadoop is the ability to parallelize jobs and send each = task to where the data resides. HDFS exists to enable that. Brisk is = just another HDFS compatible implementation. If you're already storing = your data in Cassandra and are looking to use Hadoop with it, then I = would seriously consider using Brisk. >=20 > That said, Cassandra with Hadoop works fine. >=20 > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote: >=20 > > Hi Eric, > > > > Thanks for your response. > > > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa = wrote: > > > >> Hi Tharindu, try having a look at Brisk( > >> http://www.datastax.com/products/brisk) it integrates Hadoop with > >> Cassandra and is shipped with Hive for SQL analysis. You can then = install > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in = order > >> to enable data import/export between Hadoop and MySQL. > >> Does this sound ok to you ? > >> > >> These do sound ok. But I was looking at using something from Apache = itself. > > > > Brisk sounds nice, but I feel that disregarding HDFS and totally = switching > > to Cassandra is not the right thing to do. Just my opinion there. I = feel we > > are not using the true power of Hadoop then. > > > > I feel Pig has more integration with Cassandra, so I might take a = look > > there. > > > > Whichever I choose, I will contribute the code back to the Apache = projects I > > use. Here's a sample data analysis I do with my language. Maybe, = there is no > > generic way to do what I want to do. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> 2011/8/29 Tharindu Mathew > >> > >>> Hi, > >>> > >>> I have an already running system where I define a simple data flow = (using > >>> a simple custom data flow language) and configure jobs to run = against stored > >>> data. I use quartz to schedule and run these jobs and the data = exists on > >>> various data stores (mainly Cassandra but some data exists in = RDBMS like > >>> mysql as well). > >>> > >>> Thinking about scalability and already existing support for = standard data > >>> flow languages in the form of Pig and HiveQL, I plan to move my = system to > >>> Hadoop. > >>> > >>> I've seen some efforts on the integration of Cassandra and Hadoop. = I've > >>> been reading up and still am contemplating on how to make this = change. > >>> > >>> It would be great to hear the recommended approach of doing this = on Hadoop > >>> with the integration of Cassandra and other RDBMS. For example, a = sample > >>> task that already runs on the system is "once in every hour, get = rows from > >>> column family X, aggregate data in columns A, B and C and write = back to > >>> column family Y, and enter details of last aggregated row into a = table in > >>> mysql" > >>> > >>> Thanks in advance. > >>> > >>> -- > >>> Regards, > >>> > >>> Tharindu > >>> > >> > >> > >> > >> -- > >> *Eric Djatsa Yota* > >> *Double degree MsC Student in Computer Science Engineering and > >> Communication Networks > >> T=E9l=E9com ParisTech (FRANCE) - Politecnico di Torino (ITALY)* > >> *Intern at AMADEUS S.A.S Sophia Antipolis* > >> djatsaedy@gmail.com > >> *Tel : 0601791859* > >> > >> > > > > > > -- > > Regards, > > > > Tharindu >=20 >=20 >=20 >=20 > --=20 > Regards, >=20 > Tharindu