hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Patel <arunp.bigd...@gmail.com>
Subject Re: RegexSerDe with Filters
Date Fri, 01 Jul 2016 18:19:38 GMT
Dudu,

Thanks for your continued support.  I need one more quick help.  I have one
more log file as shown below.

STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO]
TID:101114719017567668 cluster1 ProcessingHandler Message timestamp:
1467216639090

STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID:
101114719017567668 cluster1: Processed request

STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID:
101114719017567668 Request received in writer

STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
101114719017567668 Processed request

STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
101114719017567668 Total time: 10 ms


I need to create 3 views for 3 requirements.

1) create a view to get timestamp, TID number and cluster1 for lines
"ProcessingHandler
Message timestamp".  But, for this line there is no space between TID: and
TID number.


2) create a view to get timestamp, TID for the lines "Request received in
writer".  There is a space between TID: and TID number.


3) Create a view to get timestamp, TID for the lines "Total time:".  There
is a space between TID: and TID number.


How do I create base table and views?  I am planning to join these 3 views
based on TID.  Do I need to take any special considerations?


Regards,

Venkat





On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <arunp.bigdata@gmail.com> wrote:

> Dudu, Thanks for the clarification. Looks like I have an issue with my
> Hive installation.  I tried in a different cluster and it works.
>
> Thanks again.
>
>
> On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
> wrote:
>
>> This is a tested, working code.
>>
>> If you’re using https://regex101.com ,first replace backslash pairs (\\
>> ) with a single backslash (\) and also use the ‘g’ modifier in order to
>> find all of the matches.
>>
>>
>>
>> The regular expression is -
>>
>> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
>> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
>>
>>
>>
>> I’ll send you a screen shot in private, since you don’t want to expose
>> the data.
>>
>>
>>
>> Dudu
>>
>>
>>
>>
>>
>> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
>> *Sent:* Friday, June 24, 2016 9:33 PM
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: RegexSerDe with Filters
>>
>>
>>
>> Looks like Regex pattern is not working.  I tested the pattern on
>> https://regex101.com/ and it does not find any match.
>>
>>
>>
>> Any suggestions?
>>
>>
>>
>> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
>> wrote:
>>
>> My pleasure.
>>
>> Please feel free to reach me if needed.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
>> *Sent:* Wednesday, June 22, 2016 2:57 AM
>> *To:* user@hive.apache.org
>> *Subject:* Re: RegexSerDe with Filters
>>
>>
>>
>> Thank you very much, Dudu.  This really helps.
>>
>>
>>
>> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
>> wrote:
>>
>> Hi
>>
>>
>>
>> Here is the code (without the log data).
>>
>>
>>
>> I’ve created some of the views using different text processing technics.
>>
>> The rest of the views could be create in similar ways.
>>
>>
>>
>>
>>
>> Dudu
>>
>>
>>
>>
>>
>>
>>
>> bash
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> hdfs dfs -mkdir -p /tmp/log/20160621
>>
>> hdfs dfs –put logfile.txt /tmp/log/20160621
>>
>>
>>
>>
>>
>> hive
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> /*
>>
>>     External table log
>>
>>
>>
>>     Defines all common columns + optional column 'tid' which appears in
>> most log records + the rest of the log ('txt')
>>
>>
>>
>> */
>>
>>
>>
>> drop table if exists log;
>>
>>
>>
>> create external table log
>>
>> (
>>
>>     c1              string
>>
>>    ,ts              string
>>
>>    ,ts_frac         string
>>
>>    ,log_rec_level   string
>>
>>    ,c4              string
>>
>>    ,c5              string
>>
>>    ,tid             string
>>
>>    ,txt             string
>>
>> )
>>
>>     partitioned by (dt date)
>>
>>     row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>>
>>     with serdeproperties ('input.regex'='(
>> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
>> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})
>> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
>>
>>     stored as textfile
>>
>>     location '/tmp/log'
>>
>> ;
>>
>>
>>
>> alter table log add partition (dt=date '2016-06-21') location
>> '/tmp/log/20160621';
>>
>>
>>
>> select * from log;
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> /*
>>
>>     View log_v
>>
>>
>>
>>     Base view for all other views
>>
>>
>>
>> */
>>
>>
>>
>> drop view if exists log_v;
>>
>>
>>
>> create view log_v
>>
>> as
>>
>> select      c1
>>
>>            ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
>>
>>            ,log_rec_level
>>
>>            ,c4
>>
>>            ,c5
>>
>>            ,cast (ltrim(substr (tid,5)) as bigint)          as tid
>>
>>            ,txt
>>
>>
>>
>> from        log
>>
>> ;
>>
>>
>>
>> select * from log_v;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_reaping_path;
>>
>>
>>
>> create view log_v_reaping_path
>>
>> as
>>
>> select      c1
>>
>>            ,ts
>>
>>            ,log_rec_level
>>
>>            ,c4
>>
>>            ,c5
>>
>>            ,substr (txt,15)     as reaping_path
>>
>>
>>
>> from        log_V
>>
>>
>>
>> where       txt like 'Reaping path: %'
>>
>> ;
>>
>>
>>
>> select * from log_v_reaping_path;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_published_to_kafka;
>>
>>
>>
>> create view log_v_published_to_kafka
>>
>> as
>>
>> select      c1
>>
>>            ,ts
>>
>>            ,log_rec_level
>>
>>            ,c4
>>
>>            ,c5
>>
>>            ,tid
>>
>>
>>
>>            ,      ltrim (kv [' Key']      )                 as key
>>
>>            ,cast (ltrim (kv [' size']     ) as bigint )     as size
>>
>>            ,      ltrim (kv [' topic']    )                 as topic
>>
>>            ,cast (ltrim (kv [' partition']) as int    )     as partition
>>
>>            ,cast (ltrim (kv [' offset']   ) as bigint )     as offset
>>
>>
>>
>> from       (select      c1
>>
>>                        ,ts
>>
>>                        ,log_rec_level
>>
>>                        ,c4
>>
>>                        ,c5
>>
>>                        ,tid
>>
>>                        ,str_to_map (substr (txt ,locate
>> ('.',txt)+1),',',':')   as kv
>>
>>
>>
>>             from        log_V
>>
>>
>>
>>             where       txt like 'Published to Kafka. %'
>>
>>             )
>>
>>             as t
>>
>> ;
>>
>>
>>
>> select * from log_v_published_to_kafka;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_get_request;
>>
>>
>>
>> create view log_v_get_request
>>
>> as
>>
>> select      c1
>>
>>            ,ts
>>
>>            ,log_rec_level
>>
>>            ,c4
>>
>>            ,c5
>>
>>            ,tid
>>
>>            ,substr (txt,31) as path
>>
>>
>>
>> from        log_V
>>
>>
>>
>> where       txt like 'GET request received for path %'
>>
>> ;
>>
>>
>>
>> select * from log_v_get_request;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_unlock_request;
>>
>>
>>
>> create view log_v_unlock_request
>>
>> as
>>
>> select      c1
>>
>>            ,ts
>>
>>            ,log_rec_level
>>
>>            ,c4
>>
>>            ,c5
>>
>>            ,tid
>>
>>            ,regexp_extract (txt,'rowkey (\\S+)',1)      as rowkey
>>
>>            ,regexp_extract (txt,'lock id (\\S+)',1)     as lock_id
>>
>>
>>
>> from        log_V
>>
>>
>>
>> where       txt like 'Unlock request for schema DU %'
>>
>> ;
>>
>>
>>
>>
>>
>> *From:* Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
>> *Sent:* Tuesday, June 21, 2016 2:26 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: RegexSerDe with Filters
>>
>>
>>
>> Hi
>>
>>
>>
>> I would suggest creating a single external table with daily partitions
>> and multiple views each with the appropriate filtering.
>>
>> If you’ll send me log sample (~100 rows) I’ll send you an example.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com
>> <arunp.bigdata@gmail.com>]
>> *Sent:* Tuesday, June 21, 2016 1:51 AM
>> *To:* user@hive.apache.org
>> *Subject:* RegexSerDe with Filters
>>
>>
>>
>> Hello Hive Experts,
>>
>>
>>
>> I use flume to ingest application specific logs from Syslog to HDFS.
>> Currently, I grep the HDFS directory for specific patterns (for multiple
>> types of requests) and then create reports.  However, generating reports
>> for Weekly and Monthly are not salable.
>>
>>
>>
>> I would like to create multiple external on the daily HDFS directory
>> partitioned by date with RegexSerde and then create separate Parquet tables
>> for every kind of request.
>>
>>
>>
>> Question is - How do I create multiple (about 20) RegexSerde tables on
>> same data applying filters?  This will be just like 20 grep commands I am
>> running today.
>>
>>
>>
>> Example:  hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE
>> Request Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
>>
>>                 hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN
>> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
>>
>>                 hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL
>> TIME' | awk '{print $4, $24}'
>>
>>
>>
>> I would like to create a tables which does this kind of job and then
>> writes output to Parquet tables.
>>
>>
>>
>> Please let me know how this can be done.  Thank you!
>>
>>
>>
>> Regards,
>>
>> Arun
>>
>>
>>
>>
>>
>
>

Mime
View raw message