Return-Path: X-Original-To: apmail-pig-user-archive@www.apache.org Delivered-To: apmail-pig-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F1003401 for ; Fri, 6 May 2011 21:29:58 +0000 (UTC) Received: (qmail 38074 invoked by uid 500); 6 May 2011 21:29:58 -0000 Delivered-To: apmail-pig-user-archive@pig.apache.org Received: (qmail 38036 invoked by uid 500); 6 May 2011 21:29:58 -0000 Mailing-List: contact user-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@pig.apache.org Delivered-To: mailing list user@pig.apache.org Received: (qmail 38028 invoked by uid 99); 6 May 2011 21:29:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 May 2011 21:29:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shawnwan@gmail.com designates 209.85.215.49 as permitted sender) Received: from [209.85.215.49] (HELO mail-ew0-f49.google.com) (209.85.215.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 May 2011 21:29:51 +0000 Received: by ewy3 with SMTP id 3so1439889ewy.22 for ; Fri, 06 May 2011 14:29:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=AvOvhHs1xLQukmM1PYnejtH7i1zieukhrJN03gVc9xI=; b=lOrwvDkL9ceLzbNyCzoMpkaUfqzVPY8YO2rXVWIIqgoglYtaoZOgRQO07YdAtLVEjW jEVpI/aC3EgHR9Gml+TlR87hh9t9kbpfpKOCl1rqaiaZOhRBmHqjnE0UYJIWp9sTVByK HP8t4gv9HqYkJcHSpTdibkq0baFE30B1AtCjM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=hS6yGYafaDmXLP/65cl71NjKnfyWoKfHN+NWXk4tP60E0y1ZR0ekgsJXi2BUsqF+mL B1GzzunYi5xvhYu0efCIZUXBvmeIYuEw9dabLTHzQRSRgjkGtOTAFvHaAlSeUJYbU6VH rZPCsUfG86chDcheEsgbwSMGqDYoRV48mSM/I= MIME-Version: 1.0 Received: by 10.213.102.140 with SMTP id g12mr27453ebo.117.1304717370166; Fri, 06 May 2011 14:29:30 -0700 (PDT) Received: by 10.213.20.80 with HTTP; Fri, 6 May 2011 14:29:30 -0700 (PDT) In-Reply-To: References: Date: Fri, 6 May 2011 15:29:30 -0600 Message-ID: Subject: Re: Working with an unknown number of values From: Xiaomeng Wan To: user@pig.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable you can group on group, like this: A =3D LOAD '/some/dir' Using PigStorage (date, directive); B =3D GROUP A by (date, directive); C =3D FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as c= nt; D =3D group c by date; E =3D foreach D generate group as date, c.(directive,cnt) as cnts; Shawn On Fri, May 6, 2011 at 3:14 PM, Christian wrote: > I am sorry if this has been asked in the past. I can't seem to find > information on it. > > I have two questions, but they are somewhat related. > > #1) Let's say you are tracking messages and extracting the hash tags from > the message and storing them as one field (#hash1#hash2#hash3). This mean= s > you might have a line that looks something like the following: > =A0 =A0 =A02343 =A0 =A02011-05-06T03:04:00.000Z =A0 =A0username > some+message+goes+here#with+#hash+#tags =A0 =A0#with#hash#tags =A0 some = =A0 =A0other > =A0info > > How can I get the # of tweets per hash tag? Also, how can I get the # of > tweets per user per hash tag? > I know I can use the STRSPLIT function to split on '#'. That will give me= a > bag of hash tags. How can I then group by these such that each hash tag h= as > a set of tweets? > > > #2) Let's say you have a field that has a fairly small, but still unknown > number of unique values (say between 20-5). I know I can group by these > fields to get a count by doing something like so: > > A =3D LOAD '/some/dir' Using PigStorage (date, directive); > > B =3D GROUP A by (date, directive); > > C =3D FOREACH B GENERATE FLATTEN(group), COUNT(A.date); > > =A0 =A0But now I want to end up something like the following: > > 2011-05-01 =A0 =A0DIRECTIVE1 =A0 =A032423 =A0 =A0DIRECTIVE2 =A0 =A03433 = =A0 =A0DIRECTIVE3 > =A01983 > > If I knew the directives ahead of time, I know I can do something like th= e > following: > > D =3D GROUP C BY date; > > E =3D FOREACH D { > =A0 =A0 DIRECTIVE1 =3D FILTER type_count by directive =3D=3D 'DIRECTIVE1'= ; > =A0 =A0 DIRECTIVE2 =3D FILTER type_count by directive =3D=3D 'DIRECTIVE2'= ; > =A0 =A0 DIRECTIVE3 =3D FILTER type_count by directive =3D=3D 'DIRECTIVE3'= ; > =A0 =A0 =A0 =A0GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIR= ECTIVE2', > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date); > } > > But how do I do this w/o having to hardcode the filters? Am I thinking ab= out > this all wrong? > > Thanks very much for you help, > Christian >