airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wei.he (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-595) PigOperator
Date Tue, 25 Oct 2016 12:01:04 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

wei.he updated AIRFLOW-595:
---------------------------
    Description: 
When I use the PigOperator,   I happen to  two issues.
h3. 1. How should I add the "-param" to make the pig script run dynamically. For example,
I hope that it run like *pig -Dmapreduce.job.name=test -Dmapreduce.job.queuename=mapreduce
-param bid_input=/tmp/input/* -param output=/tmp/output -f test.pig*

{code:title=test.pig|borderStyle=solid}
run -param log_name=raw_log_bid -param src_folder='${input}'  load.pig;
A = FOREACH raw_log_bid GENERATE ActionId AS ActionId;
A = DISTINCT A;
C = GROUP A ALL;
D = foreach C GENERATE COUNT(A);
STORE D INTO '$output';
{code}

{code:title=task.pyp|borderStyle=solid}
...
task_pigjob = PigOperator(
    task_id='task_pigjob',
    pigparams_jinja_translate=True,    
    pig='test.pig',
    #pig=templated_command,
    dag=dag)
....
{code}
How can I set the "param" ?
h3. 2. After I add the ConnId "pig_cli_default" for PigHook and  set the Extra to {"pig_properties":
"-Dpig.tmpfilecompression=true"},  I  run test dag.
I got the following log . 
{quote}
INFO - pig -f /tmp/airflow_pigop_c83K9T/tmpqw5on_ -Dpig.tmpfilecompression=true 
[2016-10-25 17:32:44,619] {ipy_pig_hook.py:105} INFO - any environment variables that are
set by the pig command.
[2016-10-25 17:32:44,901] {models.py:1286} ERROR -
Apache Pig version 0.12.1.2.1.4.0-632 (rexported)
compiled Jul 29 2014, 18:24:35
USAGE: Pig [options] [-] : Run interactively in grunt shell.
       Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
       Pig [options] [-f[ile]] file : Run cmds found in file.
  options include:
    -4, -log4jconf - Log4j configuration file, overrides log conf
    -b, -brief - Brief logging (no timestamps)
    -c, -check - Syntax check
    -d, -debug - Debug level, INFO is default
    -e, -execute - Commands to execute (within quotes)
    -f, -file - Path to the script to execute
    -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
    -h, -help - Display this message. You can specify topic to get help for that topic.
        properties is the only topic currently supported: -h properties.
    -i, -version - Display version information
    -l, -logfile - Path to client side log file; default is current working directory.
    -m, -param_file - Path to the parameter file
    -p, -param - Key value pair of the form param=val
    -r, -dryrun - Produces script with substituted parameters. Script is not executed.
    -t, -optimizer_off - Turn optimizations off. The following values are supported:
            SplitFilter - Split filter conditions
            PushUpFilter - Filter as early as possible
            MergeFilter - Merge filter conditions
            PushDownForeachFlatten - Join or explode as late as possible
            LimitOptimizer - Limit as early as possible
            ColumnMapKeyPrune - Remove unused data
            AddForEach - Add ForEach to remove unneeded columns
            MergeForEach - Merge adjacent ForEach
            GroupByConstParallelSetter - Force parallel 1 for "group all" statement
            All - Disable all optimizations
        All optimizations listed here are enabled by default. Optimization values are case
insensitive.
    -v, -verbose - Print all error messages to screen
    -w, -warning - Turn warning logging on; also turns warning aggregation off
    -x, -exectype - Set execution mode: local|mapreduce, default is mapreduce.
    -F, -stop_on_failure - Aborts execution on the first failed job; default is off
    -M, -no_multiquery - Turn multiquery optimization off; default is on
    -P, -propertyFile - Path to property file
    -printCmdDebug - Overrides anything else and prints the actual command used to run Pig,
including
                     any environment variables that are set by the pig command.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1245, in run
    result = task_copy.execute(context=context)
  File "/usr/lib/python2.7/site-packages/airflow/operators/ipy_pig_operator.py", line 56,
in execute
    self.hook.run_cli(pig=self.pig)
  File "/usr/lib/python2.7/site-packages/airflow/hooks/ipy_pig_hook.py", line 109, in run_cli
    raise AirflowException(stdout)
AirflowException:
Apache Pig version 0.12.1.2.1.4.0-632 (rexported)
compiled Jul 29 2014, 18:24:35
{quote}

This reason is to run the command like *pig -f test.pig -Dpig.tmpfilecompression=true*
Is it fixed in the latest version ?

!image.jpg|thumbnail!

  was:
When I use the PigOperator,   I happen to  two issues.
h3. 1. How should I add the "-param" to make the pig script run dynamically. For example,
I hope that it run like *pig -Dmapreduce.job.name=test -Dmapreduce.job.queuename=mapreduce
-param bid_input=/tmp/input/* -param output=/tmp/output -f test.pig*

{code:title=test.pig|borderStyle=solid}
run -param log_name=raw_log_bid -param src_folder='${input}'  load.pig;
A = FOREACH raw_log_bid GENERATE ActionId AS ActionId;
A = DISTINCT A;
C = GROUP A ALL;
D = foreach C GENERATE COUNT(A);
STORE D INTO '$output';
{code}

{code:title=task.pyp|borderStyle=solid}
...
task_pigjob = PigOperator(
    task_id='task_pigjob',
    pigparams_jinja_translate=True,    
    pig='test.pig',
    #pig=templated_command,
    dag=dag)
....
{code}
How can I set the "param" ?
h3. 2. After I add the ConnId "pig_cli_default" for PigHook and  set the Extra to {"pig_properties":
"-Dpig.tmpfilecompression=true"},  I  run test dag.
I got the following log . 
{quote}
INFO - pig -f /tmp/airflow_pigop_c83K9T/tmpqw5on_ -Dpig.tmpfilecompression=true 
[2016-10-25 17:32:44,619] {ipy_pig_hook.py:105} INFO - any environment variables that are
set by the pig command.
[2016-10-25 17:32:44,901] {models.py:1286} ERROR -
Apache Pig version 0.12.1.2.1.4.0-632 (rexported)
compiled Jul 29 2014, 18:24:35
USAGE: Pig [options] [-] : Run interactively in grunt shell.
       Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
       Pig [options] [-f[ile]] file : Run cmds found in file.
  options include:
    -4, -log4jconf - Log4j configuration file, overrides log conf
    -b, -brief - Brief logging (no timestamps)
    -c, -check - Syntax check
    -d, -debug - Debug level, INFO is default
    -e, -execute - Commands to execute (within quotes)
    -f, -file - Path to the script to execute
    -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
    -h, -help - Display this message. You can specify topic to get help for that topic.
        properties is the only topic currently supported: -h properties.
    -i, -version - Display version information
    -l, -logfile - Path to client side log file; default is current working directory.
    -m, -param_file - Path to the parameter file
    -p, -param - Key value pair of the form param=val
    -r, -dryrun - Produces script with substituted parameters. Script is not executed.
    -t, -optimizer_off - Turn optimizations off. The following values are supported:
            SplitFilter - Split filter conditions
            PushUpFilter - Filter as early as possible
            MergeFilter - Merge filter conditions
            PushDownForeachFlatten - Join or explode as late as possible
            LimitOptimizer - Limit as early as possible
            ColumnMapKeyPrune - Remove unused data
            AddForEach - Add ForEach to remove unneeded columns
            MergeForEach - Merge adjacent ForEach
            GroupByConstParallelSetter - Force parallel 1 for "group all" statement
            All - Disable all optimizations
        All optimizations listed here are enabled by default. Optimization values are case
insensitive.
    -v, -verbose - Print all error messages to screen
    -w, -warning - Turn warning logging on; also turns warning aggregation off
    -x, -exectype - Set execution mode: local|mapreduce, default is mapreduce.
    -F, -stop_on_failure - Aborts execution on the first failed job; default is off
    -M, -no_multiquery - Turn multiquery optimization off; default is on
    -P, -propertyFile - Path to property file
    -printCmdDebug - Overrides anything else and prints the actual command used to run Pig,
including
                     any environment variables that are set by the pig command.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1245, in run
    result = task_copy.execute(context=context)
  File "/usr/lib/python2.7/site-packages/airflow/operators/ipy_pig_operator.py", line 56,
in execute
    self.hook.run_cli(pig=self.pig)
  File "/usr/lib/python2.7/site-packages/airflow/hooks/ipy_pig_hook.py", line 109, in run_cli
    raise AirflowException(stdout)
AirflowException:
Apache Pig version 0.12.1.2.1.4.0-632 (rexported)
compiled Jul 29 2014, 18:24:35
{quote}

This reason is to run the command like *pig -f test.pig -Dpig.tmpfilecompression=true*
Is it fixed in the latest version ?

!https://goo.gl/photos/qxbXNAGgqdowDmmt5!


> PigOperator
> -----------
>
>                 Key: AIRFLOW-595
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-595
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: operators
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: wei.he
>
> When I use the PigOperator,   I happen to  two issues.
> h3. 1. How should I add the "-param" to make the pig script run dynamically. For example,
> I hope that it run like *pig -Dmapreduce.job.name=test -Dmapreduce.job.queuename=mapreduce
-param bid_input=/tmp/input/* -param output=/tmp/output -f test.pig*
> {code:title=test.pig|borderStyle=solid}
> run -param log_name=raw_log_bid -param src_folder='${input}'  load.pig;
> A = FOREACH raw_log_bid GENERATE ActionId AS ActionId;
> A = DISTINCT A;
> C = GROUP A ALL;
> D = foreach C GENERATE COUNT(A);
> STORE D INTO '$output';
> {code}
> {code:title=task.pyp|borderStyle=solid}
> ...
> task_pigjob = PigOperator(
>     task_id='task_pigjob',
>     pigparams_jinja_translate=True,    
>     pig='test.pig',
>     #pig=templated_command,
>     dag=dag)
> ....
> {code}
> How can I set the "param" ?
> h3. 2. After I add the ConnId "pig_cli_default" for PigHook and  set the Extra to {"pig_properties":
"-Dpig.tmpfilecompression=true"},  I  run test dag.
> I got the following log . 
> {quote}
> INFO - pig -f /tmp/airflow_pigop_c83K9T/tmpqw5on_ -Dpig.tmpfilecompression=true 
> [2016-10-25 17:32:44,619] {ipy_pig_hook.py:105} INFO - any environment variables that
are set by the pig command.
> [2016-10-25 17:32:44,901] {models.py:1286} ERROR -
> Apache Pig version 0.12.1.2.1.4.0-632 (rexported)
> compiled Jul 29 2014, 18:24:35
> USAGE: Pig [options] [-] : Run interactively in grunt shell.
>        Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
>        Pig [options] [-f[ile]] file : Run cmds found in file.
>   options include:
>     -4, -log4jconf - Log4j configuration file, overrides log conf
>     -b, -brief - Brief logging (no timestamps)
>     -c, -check - Syntax check
>     -d, -debug - Debug level, INFO is default
>     -e, -execute - Commands to execute (within quotes)
>     -f, -file - Path to the script to execute
>     -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
>     -h, -help - Display this message. You can specify topic to get help for that topic.
>         properties is the only topic currently supported: -h properties.
>     -i, -version - Display version information
>     -l, -logfile - Path to client side log file; default is current working directory.
>     -m, -param_file - Path to the parameter file
>     -p, -param - Key value pair of the form param=val
>     -r, -dryrun - Produces script with substituted parameters. Script is not executed.
>     -t, -optimizer_off - Turn optimizations off. The following values are supported:
>             SplitFilter - Split filter conditions
>             PushUpFilter - Filter as early as possible
>             MergeFilter - Merge filter conditions
>             PushDownForeachFlatten - Join or explode as late as possible
>             LimitOptimizer - Limit as early as possible
>             ColumnMapKeyPrune - Remove unused data
>             AddForEach - Add ForEach to remove unneeded columns
>             MergeForEach - Merge adjacent ForEach
>             GroupByConstParallelSetter - Force parallel 1 for "group all" statement
>             All - Disable all optimizations
>         All optimizations listed here are enabled by default. Optimization values are
case insensitive.
>     -v, -verbose - Print all error messages to screen
>     -w, -warning - Turn warning logging on; also turns warning aggregation off
>     -x, -exectype - Set execution mode: local|mapreduce, default is mapreduce.
>     -F, -stop_on_failure - Aborts execution on the first failed job; default is off
>     -M, -no_multiquery - Turn multiquery optimization off; default is on
>     -P, -propertyFile - Path to property file
>     -printCmdDebug - Overrides anything else and prints the actual command used to run
Pig, including
>                      any environment variables that are set by the pig command.
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1245, in run
>     result = task_copy.execute(context=context)
>   File "/usr/lib/python2.7/site-packages/airflow/operators/ipy_pig_operator.py", line
56, in execute
>     self.hook.run_cli(pig=self.pig)
>   File "/usr/lib/python2.7/site-packages/airflow/hooks/ipy_pig_hook.py", line 109, in
run_cli
>     raise AirflowException(stdout)
> AirflowException:
> Apache Pig version 0.12.1.2.1.4.0-632 (rexported)
> compiled Jul 29 2014, 18:24:35
> {quote}
> This reason is to run the command like *pig -f test.pig -Dpig.tmpfilecompression=true*
> Is it fixed in the latest version ?
> !image.jpg|thumbnail!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message