pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Working with an unknown number of values
Date Tue, 10 May 2011 21:27:21 GMT
TOKENIZE takes a string and returns a bag.  It's issue is right now it  
only allows you to split on whitespace.  It would make sense to  
generalize this to take a delimiter.

Alan.

On May 7, 2011, at 7:55 PM, Jacob Perkins wrote:

> Dmitriy,
>
>   I see your point. It would definitely be nice to have a builtin for
> returning a bag though. I'd actually be happy if
> TOBAG(FLATTEN(STRSPLIT(X,','))) worked.
>
> --jacob
> @thedatachef
>
> On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
>> FWIW -- the reason STRSPLIT returns a Tuple is that the more common
>> case is thought to be splitting a string of a known format and trying
>> to get some part of it.
>>
>> so, "foreach address_book generate STRSPLIT(phone_number, '-') as
>> (area_code, top_3, bottom_4);"
>>
>> RegexExtractAll (whatever it's called these days) should return a  
>> bag, iirc.
>>
>> D
>>
>> On Fri, May 6, 2011 at 2:59 PM, jacob <jacob.a.perkins@gmail.com>  
>> wrote:
>>> On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
>>>>>
>>>>>> #1) Let's say you are tracking messages and extracting the hash 

>>>>>> tags from
>>>>>> the message and storing them as one field (#hash1#hash2#hash3). 

>>>>>> This
>>>>> means
>>>>>> you might have a line that looks something like the following:
>>>>>>      2343    2011-05-06T03:04:00.000Z    username
>>>>>> some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
>>>>> other
>>>>>> info
>>>>>>
>>>>>> How can I get the # of tweets per hash tag? Also, how can I get 

>>>>>> the # of
>>>>>> tweets per user per hash tag?
>>>>>> I know I can use the STRSPLIT function to split on '#'. That  
>>>>>> will give me
>>>>> a
>>>>>> bag of hash tags. How can I then group by these such that each  
>>>>>> hash tag
>>>>> has
>>>>>> a set of tweets?
>>>>> You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP  
>>>>> BY' on
>>>>> the hashtag itself.
>>>>>
>>>>
>>>> If each message has an unknown number of hashtags, will a  
>>>> 'FLATTEN' given me
>>>> an unknown # of fields? If so, how do I know which field to group  
>>>> by? I
>>>> don't want to group by messages that have the exact hash tags. I  
>>>> want all
>>>> messages that have one of the hash tags.
>>>
>>> Oh, that's right, STRSPLIT (rather uselessly) yields a nested  
>>> tuple and
>>> NOT a bag. If you could get a bag then you could do the following  
>>> (I'm
>>> throwing out some fields for now):
>>>
>>> A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
>>> B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
>>> hashtag;
>>> C = GROUP B BY hastag;
>>>
>>> Then C will contain a key (the hashtag) and a bag containing all the
>>> tweets with that hashtag. You'll have to write 'MySplittingUDF'  
>>> yourself
>>> to do the same as STRSPLIT but that returns a bag instead.
>>>
>>> ie.
>>>
>>> #foobar tweet text,#foobar
>>> this tweet has #two #hashtags,#two#hashtags
>>> another #foobar tweet,#foobar
>>>
>>> will yield:
>>>
>>> #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
>>> #foobar)}
>>> #two,      {(this tweet has #two #hashtags, #two)}
>>> #hashtags, {(this tweet has #two #hashtags, #hashtags)}
>>>
>>>
>>>>
>>>>
>>>>>>    But now I want to end up something like the following:
>>>>>
>>>>>
>>>>>> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433     
>>>>>> DIRECTIVE3
>>>>>> 1983
>>>>>>
>>>>>> If I knew the directives ahead of time, I know I can do  
>>>>>> something like
>>>>> the
>>>>>> following:
>>>>>>
>>>>>> D = GROUP C BY date;
>>>>>>
>>>>>> E = FOREACH D {
>>>>>>     DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>>>>>>     DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>>>>>>     DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>>>>>>        GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
>>>>> 'DIRECTIVE2',
>>>>>> COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
>>>>>> }
>>>>>>
>>>>>> But how do I do this w/o having to hardcode the filters? Am I  
>>>>>> thinking
>>>>> about
>>>>>> this all wrong?
>>>>>>
>>>>> It's really a matter of how you structure your data ahead of time.
>>>>> Imagine the data looking like this instead (call it X):
>>>>>
>>>>> 201101,directive1
>>>>> 201101,directive1
>>>>> 201101,directive2
>>>>> 201101,directive2
>>>>> 201101,directive2
>>>>> 201101,directive3
>>>>> 201102,directive2
>>>>> 201102,directive4
>>>>> 201103,directive1
>>>>>
>>>>> This is how my data looks (row and column wise)
>>>>
>>>>>
>>>>> then, a simple:
>>>>>
>>>>> Y = GROUP X BY (date,directive);
>>>>> Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive),  
>>>>> COUNT(X) AS
>>>>> num_occurrences;
>>>>>
>>>>> would result in:
>>>>>
>>>>> 201101,directive1,2
>>>>> 201101,directive2,3
>>>>> 201101,directive3,1
>>>>> 201102,directive2,1
>>>>> 201102,directive4,1
>>>>> 201103,directive1,1
>>>>>
>>>>> At least, that's what it _seems_ like you're asking for.
>>>>>
>>>>> I've gotten that far. I'm actually asking for the being able to  
>>>>> put those
>>>> into columns and not rows.
>>>>
>>>>>
>>>>> --jacob
>>>>> @thedatachef
>>>>>
>>>>> Thanks Jacob!
>>>>
>>>> -Christian
>>>>
>>>>>
>>>>>> Thanks very much for you help,
>>>>>> Christian
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>


Mime
View raw message