Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3229046D2 for ; Wed, 15 Jun 2011 19:25:58 +0000 (UTC) Received: (qmail 95947 invoked by uid 500); 15 Jun 2011 19:25:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95918 invoked by uid 500); 15 Jun 2011 19:25:55 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95910 invoked by uid 99); 15 Jun 2011 19:25:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jun 2011 19:25:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jeremy.hanna1234@gmail.com designates 209.85.161.172 as permitted sender) Received: from [209.85.161.172] (HELO mail-gx0-f172.google.com) (209.85.161.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jun 2011 19:25:48 +0000 Received: by gxk19 with SMTP id 19so607458gxk.31 for ; Wed, 15 Jun 2011 12:25:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:subject:mime-version:content-type:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to:x-mailer; bh=ffdFwpjmX398ls+TlRZF3OJJiN4r1MbFREj6YAAnjLI=; b=fUfPArN4mro2kCKXKnaT4Oj3iHs4S63nV7xVtlr2OU9koJmxH3Uw20IKVg7/IxjqCW b+W65HxyGHzIRslYEo0RjYtNzRqalWTRaLEWN9Wb7Amsdxoia/e+uD6N9BWl2fBFiHJJ sY9L+tz2ylmN7EKvmcvCU2IL08ksZ7hOCyDkE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; b=rYXVW74PgD46zMHFyQIvkldxAQpyXiDjWNVzFzg0TI2zpdZJ4zsMVdAi0XnpXv6g/j gAAZMXRn4qVmApC6jX6W06+so56SV+rXwovhr3OpmCwCaY3mkg8/Ch6E78AOWHgaXDH2 bOZy2QuXFJYUo+BAkqw154EpG0+jOX59YjW8w= Received: by 10.236.109.164 with SMTP id s24mr149198yhg.353.1308165926618; Wed, 15 Jun 2011 12:25:26 -0700 (PDT) Received: from [172.31.132.100] ([64.27.47.116]) by mx.google.com with ESMTPS id i30sm419727yhm.49.2011.06.15.12.25.25 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 15 Jun 2011 12:25:25 -0700 (PDT) Subject: Re: prep for cassandra storage from pig Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Jeremy Hanna In-Reply-To: Date: Wed, 15 Jun 2011 14:25:24 -0500 Cc: user@cassandra.apache.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <7F9E700F-C655-4639-A830-0DF405EC2CDB@gmail.com> To: user@pig.apache.org X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org Yeah - for completely dynamic column names, then yeah - From/To = Cassandra Bag doesn't handle that. It does handle prefixed names though = - like link* will get a bag of all the columns that start with link. = But sounds like you are doing what I would have to do if I got into a = nested data conundrum. Like I said, others may have better advice for = getting the data the way you want it. On Jun 15, 2011, at 2:08 PM, William Oberman wrote: > My problem is the column names are dynamic (a date), and pygmalion = seems to > want the column names to be fixed at "compile time" (the script). >=20 > On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna = wrote: >=20 >> Hi Will, >>=20 >> That's partly why I like to use FromCassandraBag and ToCassandraBag = from >> pygmalion - it does the work for you to get it back into a form that >> cassandra understands. >>=20 >> Others may know better how to massage the data into that form using = just >> pig, but if all else fails, you could write a udf to do that. >>=20 >> Jeremy >>=20 >> On Jun 15, 2011, at 1:17 PM, William Oberman wrote: >>=20 >>> I think I'm stuck on typing issues trying to store data in = cassandra. To >> verify, cassandra wants (key, {tuples}) >>>=20 >>> My pig script is fairly brief: >>> raw =3D LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() = AS >> (key:chararray, columns:bag {column:tuple (name, value)}); >>> --colums =3D=3D timeUUID -> JSON >>> rows =3D FOREACH raw GENERATE key, FLATTEN(columns); >>> alias_target_day =3D FOREACH rows { >>> --I wrote a specialized parser that does exactly what I need >>> observation_map =3D com.civicscience.pig.ParseObservation($2); >>> GENERATE $0 as alias, observation_map#'_fqt' as target, >> observation_map#'_day' as day; >>> }; >>> grouping =3D GROUP alias_target_day BY = ((chararray)target,(chararray)day); >>> X =3D FOREACH grouping GENERATE group.$0 as target, = TOTUPLE(group.$1, >> COUNT($1)) as day_count; >>>=20 >>> This gets me: >>> (targetA, (day1, count)) >>> (targetA, (day2, count)) >>> (targetB, (day1, count)) >>> .... >>>=20 >>> But, cassandra wants the 2nd item to be a bag. So, I tried: >>> X =3D FOREACH grouping GENERATE group.$0 as target, = TOBAG(TOTUPLE(group.$1, >> COUNT($1))) as day_count; >>>=20 >>> But this results in: >>> (targetA, {((day1, count))}) >>> (targetA, {((day2, count))}) >>> (targetB, {((day1, count))}) >>> It's hard to see, but the 2nd item now has a nested tuple as the = first >> value, which is still bad. >>>=20 >>> How to I get (key, {tuple})??? I wasn't sure where to post this = (pig or >> cassandra), so I'm posting to the pig list too. >>>=20 >>> will >>=20 >>=20 >=20 >=20 > --=20 > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) oberman@civicscience.com