hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
Date Thu, 08 Jul 2010 22:45:51 GMT

     [ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Ding updated PIG-1483:
------------------------------

    Attachment:     (was: PIG-1483.patch)

> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
>                 Key: PIG-1483
>                 URL: https://issues.apache.org/jira/browse/PIG-1483
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1483.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible
to use Pig for querying Hadoop job history/xml files to get script-level usage statistics.
What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[],
m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user,
(Chararray) j#'JOBID' as job; 
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[],
r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name,
(Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[],
r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name,
(Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message