pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem Ervits (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3119) Aggregation not working in conjunction with REGEX_EXTRACT_ALL
Date Fri, 30 Sep 2016 20:50:20 GMT

    [ https://issues.apache.org/jira/browse/PIG-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537030#comment-15537030
] 

Artem Ervits commented on PIG-3119:
-----------------------------------


I tried it with the script and log file provided, B does contain empty tuples.
```
grunt> A = LOAD 'starwar_log1.txt' USING TextLoader AS (line:chararray);
grunt> B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) \\[(\\w:/+\\s[+\\-]d{4})]
"(.?)" (S) (S+) "([^"])" "([^"])" "([^"]*)" (S+) ') ) AS
>> (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray, Mozilla: chararray,wookie_cookie:
chararray,browser3: chararray,acess_status:int);
grunt> dump B;
```
```
()
()
()
```




> Aggregation not working in conjunction with REGEX_EXTRACT_ALL
> -------------------------------------------------------------
>
>                 Key: PIG-3119
>                 URL: https://issues.apache.org/jira/browse/PIG-3119
>             Project: Pig
>          Issue Type: Bug
>          Components: build, grunt
>    Affects Versions: 0.9.1
>         Environment: OS -version
> ================================
> Linux version 2.6.18-194.3.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2
20080704 (Red Hat 4.1.2-48))
> software installed
> =======================
> hadoop-1.0.4
> pig-0.9.1
> Hardware details
> ====================================
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 26
> model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
> stepping        : 4
> cpu MHz         : 2800.098
> cache size      : 8192 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
>            Reporter: siddhartha Pattanaik
>            Priority: Critical
>              Labels: newbie
>             Fix For: 0.9.1
>
>         Attachments: starwar_log1.txt
>
>   Original Estimate: 276h
>  Remaining Estimate: 276h
>
> Hi ,
> I have a use case in my project requirement,
> The i/p file consist of the following pattern:-
> 192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET /img/explore/encyclopedia/characters/yoda_card.jpg
HTTP/1.1" 200 22620 "http://www.starwars.com/explore/encyclopedia/characters/2/featured/"
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)" "Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3;
__utma=181359608.119611689.1337206567.1337206567.1337206567.1; __utmb=181359608.79.9.1337209104786;
__utmc=181359608; __utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);
JSESSIONID=aHX_NQheRq08" "-" 0
> I want to run a aggregate function along with regex_extract_all to extract the desired
data.
> Even though the i/p file is parsing.I have issue with aggregate function working on it.
> Please find the below pig script:-
> ***************Ip_adress-count************************
> Ip_adress_count.pig
>  
> A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray);
> B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "([^"]*)" (\\S+) ') ) AS 
> (
> remoteAddr: chararray, 
> remoteLogname: chararray, 
> user: chararray,  
> time: chararray, 
> request: chararray, 
> status: int, 
> bytes_string: chararray, 
> referrer: chararray, 
> Mozilla: chararray,
> wookie_cookie: chararray,
> browser3: chararray,
> acess_status:int
> );
> C = group B by remoteAddr;
> D = foreach C generate COUNT(B) as ip_adress_count;
> E = order D by ip_adress_count;
> F = STORE E INTO ‘ip_adress_count/' using PigStorage(',');
> Expected O/p
> ===========================
> ip_adress_count
> remoteAddr,ip_adress_count
> 192.168.90.36,19
> 192.168.90.37,1
> There is no parsing issue but the aggregate function count() is not working over the
regex_extract_all function for regular expression.
> Please do the need.The requirement is I need the count of the ip adresses from the ip
data.
> thanks,
> siddharth
> contact -8763666372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message