pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4639) Add better parser for Apache HTTPD access log.
Date Wed, 05 Aug 2015 15:16:04 GMT

    [ https://issues.apache.org/jira/browse/PIG-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658348#comment-14658348
] 

Niels Basjes commented on PIG-4639:
-----------------------------------

A simple first test you can run is after building pig with this patch is to run this pig script
locally:

{code}
REGISTER ./contrib/piggybank/java/piggybank.jar

Example =
    LOAD 'test.pig'
    USING org.apache.pig.piggybank.storage.apachelog.LogFormatLoader('combined');
DUMP Example;
{code}

The output is the example on how you can define a working parser that gives you the fields
you want.
This output is actual working pig code that will parse an Apache httpd accesslog file in the
given format into all the fields requested.
In this example case this output is a single tuple with a single string that looks like this:
{code}
(


Clicks =
    LOAD 'access.log'
    USING org.apache.pig.piggybank.storage.apachelog.LogFormatLoader(
        'combined',

        'IP:connection.client.host',
        'NUMBER:connection.client.logname',
        'STRING:connection.client.user',
        'TIME.STAMP:request.receive.time',
        'TIME.DAY:request.receive.time.day',
        'TIME.MONTHNAME:request.receive.time.monthname',
        'TIME.MONTH:request.receive.time.month',
        'TIME.WEEK:request.receive.time.weekofweekyear',
        'TIME.YEAR:request.receive.time.weekyear',
        'TIME.YEAR:request.receive.time.year',
        'TIME.HOUR:request.receive.time.hour',
        'TIME.MINUTE:request.receive.time.minute',
        'TIME.SECOND:request.receive.time.second',
        'TIME.MILLISECOND:request.receive.time.millisecond',
        'TIME.ZONE:request.receive.time.timezone',
        'TIME.EPOCH:request.receive.time.epoch',
        'TIME.DAY:request.receive.time.day_utc',
        'TIME.MONTHNAME:request.receive.time.monthname_utc',
        'TIME.MONTH:request.receive.time.month_utc',
        'TIME.WEEK:request.receive.time.weekofweekyear_utc',
        'TIME.YEAR:request.receive.time.weekyear_utc',
        'TIME.YEAR:request.receive.time.year_utc',
        'TIME.HOUR:request.receive.time.hour_utc',
        'TIME.MINUTE:request.receive.time.minute_utc',
        'TIME.SECOND:request.receive.time.second_utc',
        'TIME.MILLISECOND:request.receive.time.millisecond_utc',
        'HTTP.FIRSTLINE:request.firstline',
        'HTTP.METHOD:request.firstline.method',
        'HTTP.URI:request.firstline.uri',
        'HTTP.PROTOCOL:request.firstline.uri.protocol',
        'HTTP.USERINFO:request.firstline.uri.userinfo',
        'HTTP.HOST:request.firstline.uri.host',
        'HTTP.PORT:request.firstline.uri.port',
        'HTTP.PATH:request.firstline.uri.path',
        'HTTP.QUERYSTRING:request.firstline.uri.query',
        'STRING:request.firstline.uri.query.*',  -- If you only want a single field replace
* with name and change type to chararray',
        'HTTP.REF:request.firstline.uri.ref',
        'HTTP.PROTOCOL:request.firstline.protocol',
        'HTTP.PROTOCOL.VERSION:request.firstline.protocol.version',
        'STRING:request.status.last',
        'BYTES:response.body.bytesclf',
        'HTTP.URI:request.referer',
        'HTTP.PROTOCOL:request.referer.protocol',
        'HTTP.USERINFO:request.referer.userinfo',
        'HTTP.HOST:request.referer.host',
        'HTTP.PORT:request.referer.port',
        'HTTP.PATH:request.referer.path',
        'HTTP.QUERYSTRING:request.referer.query',
        'STRING:request.referer.query.*',  -- If you only want a single field replace * with
name and change type to chararray',
        'HTTP.REF:request.referer.ref',
        'HTTP.USERAGENT:request.user-agent')
    AS (
        connection_client_host:chararray,
        connection_client_logname:long,
        connection_client_user:chararray,
        request_receive_time:chararray,
        request_receive_time_day:long,
        request_receive_time_monthname:chararray,
        request_receive_time_month:long,
        request_receive_time_weekofweekyear:long,
        request_receive_time_weekyear:long,
        request_receive_time_year:long,
        request_receive_time_hour:long,
        request_receive_time_minute:long,
        request_receive_time_second:long,
        request_receive_time_millisecond:long,
        request_receive_time_timezone:chararray,
        request_receive_time_epoch:long,
        request_receive_time_day_utc:long,
        request_receive_time_monthname_utc:chararray,
        request_receive_time_month_utc:long,
        request_receive_time_weekofweekyear_utc:long,
        request_receive_time_weekyear_utc:long,
        request_receive_time_year_utc:long,
        request_receive_time_hour_utc:long,
        request_receive_time_minute_utc:long,
        request_receive_time_second_utc:long,
        request_receive_time_millisecond_utc:long,
        request_firstline:chararray,
        request_firstline_method:chararray,
        request_firstline_uri:chararray,
        request_firstline_uri_protocol:chararray,
        request_firstline_uri_userinfo:chararray,
        request_firstline_uri_host:chararray,
        request_firstline_uri_port:long,
        request_firstline_uri_path:chararray,
        request_firstline_uri_query:chararray,
        request_firstline_uri_query__:map[],  -- If you only want a single field replace *
with name and change type to chararray,
        request_firstline_uri_ref:chararray,
        request_firstline_protocol:chararray,
        request_firstline_protocol_version:chararray,
        request_status_last:chararray,
        response_body_bytesclf:long,
        request_referer:chararray,
        request_referer_protocol:chararray,
        request_referer_userinfo:chararray,
        request_referer_host:chararray,
        request_referer_port:long,
        request_referer_path:chararray,
        request_referer_query:chararray,
        request_referer_query__:map[],  -- If you only want a single field replace * with
name and change type to chararray,
        request_referer_ref:chararray,
        request_user_agent:chararray);



)

{code}


> Add better parser for Apache HTTPD access log.
> ----------------------------------------------
>
>                 Key: PIG-4639
>                 URL: https://issues.apache.org/jira/browse/PIG-4639
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.15.0
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>             Fix For: 0.16.0
>
>         Attachments: PIG-4639-20150723-classnotfound.patch, PIG-4639-20150725.patch,
PIG-4639-20150805-1247.patch
>
>
> Currently there are two parsers for Apache HTTPD acces log files in piggybank that only
allow parsing the 'combined' and 'common' logformats. These two also only parse the 'basics'.
> This is proposed patch to add the existing https://github.com/nielsbasjes/logparser (Apache
2.0 license) as an 'out of the box' parser to piggybank. 
> This parser parses the logfile using the LogFormat specification used to writte it. Almost
all LogFormat specifiers are supported and as such adds easy parsing capabilities for (almost)
all custom logformats used in production scenarios. 
> This parser also goes much deeper in the sense that it allows extracting things like
the value of a cookie or the value of a  query string parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message