hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages
Date Thu, 29 Nov 2007 17:40:43 GMT

    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811
] 

ciemo edited comment on PIG-12 at 11/29/07 9:40 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external
events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of
my goals was to have 3 running times (time for current task/job, overall running time, overall
time of queue time + running time).  I really didn't want to have write a parsing script.
 In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it
makes the job harder because I have to convert the timestamps into time values and then do
the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more
valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to
get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time
stamps.

If there were one scenario where I might turn off timestamps by default, it would be when
people are using the Pig grunt shell interactively.  But as soon as we are in batch mode,
add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution
environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

{noformat}
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string
), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT
$2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
{noformat}

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{code:perl}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{code}

      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external
events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of
my goals was to have 3 running times (time for current task/job, overall running time, overall
time of queue time + running time).  I really didn't want to have write a parsing script.
 In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it
makes the job harder because I have to convert the timestamps into time values and then do
the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more
valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to
get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time
stamps.

If there were one scenario where I might turn off timestamps by default, it would be when
people are using the Pig grunt shell interactively.  But as soon as we are in batch mode,
add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution
environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string
), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT
$2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{noformat}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{noformat}
  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after
the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution
time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message