hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Irfan Mohammed <irfan...@gmail.com>
Subject Re: cannot start the transform script. reason : "argument list too long"
Date Tue, 01 Mar 2011 21:39:03 GMT
Thanks Ajo.

The hive sql and the python script run fine with a dataset of ( < 10M )
records. It is only when running with a large dataset of ( > 100M ) records,
that I run into this error. I have been trying to get to the number of
records (10-100M) or the invalid data which is causing the error.

I am trying to get to the actual error from the UnixProcess instantiation.

On Tue, Mar 1, 2011 at 1:23 PM, Ajo Fod <ajo.fod@gmail.com> wrote:

> instead of
>
> using 'python2.6 user_id_output.py hbase'
>
> try something like this:
>
> using 'user_id_output.py'
> ... and a #! line with the location of the python binary.
>
> I think you can include a parameter too in the call like :
> using 'user_id_output.py hbase'
>
> Cheers,
> Ajo.
>
>
> On Tue, Mar 1, 2011 at 8:22 AM, Irfan Mohammed <irfan.ma@gmail.com> wrote:
>
>> Hi,
>>
>> I have a hive script [given below] which calls a python script using
>> transform and for large datasets [ > 100M rows ], the reducer is not able to
>> start the python process and the error message is "argument list too long".
>> The detailed error stack is given below.
>>
>> The python script takes only 1 static argument "hbase". For small datasets
>> [ 1M rows ], the script works fine.
>>
>>    1. Is this problem related to the number of open file handles on the
>>    reducer box?
>>    2. How do I get the correct error message?
>>    3. Is there a way to get to the actual unix process with arguments it
>>    is instantiating?
>>
>> Thanks,
>> Irfan
>>
>> >>>>>>>>>>>>>> script >>>>>>>>>>>>>>>>>>>>>>>>>>>
>>
>> true && echo "
>> set hive.exec.compress.output=true;
>> set io.seqfile.compression.type=BLOCK;
>> set mapred.output.compression.type=BLOCK;
>> set
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>
>> set hive.merge.mapfiles=false;
>> set hive.exec.dynamic.partition=true;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>> set hive.exec.max.dynamic.partitions=20000;
>>
>> set hive.map.aggr=true;
>> set mapred.reduce.tasks=200;
>>
>> add file /home/irfan/scripts/user_id_output.py;
>>
>> from (
>>   select
>>     user_id,
>>     source_id,
>>     load_id,
>>     user_id_json,
>>     pid
>>   from
>>     user_id_bucket
>>   where
>>         1 = 1
>>     and length(user_id) = 40
>>   distribute by pid
>>   sort by user_id, load_id desc
>> ) T1
>> insert overwrite table user_id_hbase_stg1 partition (pid)
>> SELECT
>>   transform
>>   (
>>       T1.user_id
>>     , T1.source_id
>>     , T1.load_id
>>     , T1.user_id_json
>>     , T1.pid
>>   )
>>   using 'python2.6 user_id_output.py hbase'
>>   as user_id, user_id_info1, user_id_info2, user_id_info3, user_id_info4,
>> user_id_info5, user_id_info3, pid
>> ;
>>
>> " > ${SQL_FILE};
>>
>> true && ${HHIVE} -f ${SQL_FILE} 1>${TXT_FILE} 2>${LOG_FILE};
>>
>> >>>>>>>>>>>>>> error log >>>>>>>>>>>>>>>>>>>>>>>>>
>>
>> 2011-03-01 14:46:13,705 INFO
>> org.apache.hadoop.hive.ql.exec.ExtractOperator: Initializing Self 5 OP
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.ExtractOperator: Operator 5 OP initialized
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.ExtractOperator: Initializing children of 5
>> OP
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing child 6 SEL
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing Self 6 SEL
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT
>> struct<_col0:string,_col1:string,_col2:string,_col3:string,_col4:string>
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: Operator 6 SEL initialized
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing children of 6
>> SEL
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Initializing child 7 SCR
>> 2011-03-01 14:46:13,711 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Initializing Self 7 SCR
>> 2011-03-01 14:46:13,728 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Operator 7 SCR initialized
>> 2011-03-01 14:46:13,728 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Initializing children of 7
>> SCR
>> 2011-03-01 14:46:13,728 INFO
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing child 8 FS
>> 2011-03-01 14:46:13,728 INFO
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing Self 8 FS
>> 2011-03-01 14:46:13,730 INFO
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator: Operator 8 FS initialized
>> 2011-03-01 14:46:13,730 INFO
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initialization Done 8 FS
>> 2011-03-01 14:46:13,730 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Initialization Done 7 SCR
>> 2011-03-01 14:46:13,730 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: Initialization Done 6 SEL
>> 2011-03-01 14:46:13,730 INFO
>> org.apache.hadoop.hive.ql.exec.ExtractOperator: Initialization Done 5 OP
>> 2011-03-01 14:46:13,733 INFO ExecReducer: ExecReducer: processing 1 rows:
>> used memory = 89690888
>> 2011-03-01 14:46:13,733 INFO
>> org.apache.hadoop.hive.ql.exec.ExtractOperator: 5 forwarding 1 rows
>> 2011-03-01 14:46:13,733 INFO
>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
>> 2011-03-01 14:46:13,733 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Executing
>> [/usr/bin/python2.6, user_id_output.py, hbase]
>> 2011-03-01 14:46:13,733 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: tablename=null
>> 2011-03-01 14:46:13,733 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: partname=null
>> 2011-03-01 14:46:13,733 INFO
>> org.apache.hadoop.hive.ql.exec.ScriptOperator: alias=null
>> 2011-03-01 14:46:13,777 FATAL ExecReducer:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>> processing row
>> (tag=0) {"key":{"reducesinkkey0":"AA11223344","reducesinkkey1":"20110210_02"},"value":{"_col0":"xxxxx","_col1":"m1","_col2":"20110210_02","_col3":"{'m07':
>> 'x12', 'm02': 'x34', 'm01': 'm45'}","_col4":"0A9"},"alias":0}
>>  at
>> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253)
>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
>>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>>  at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>  at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>> at org.apache.hadoop.mapred.Child.main(Child.java:211)
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot
>> initialize ScriptOperator
>> at
>> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:320)
>>  at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457)
>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697)
>>  at
>> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457)
>>  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697)
>> at
>> org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
>>  at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457)
>> at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:244)
>>  ... 7 more
>> Caused by: java.io.IOException: Cannot run program "/usr/bin/python2.6":
>> java.io.IOException: error=7, Argument list too long
>>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>> at
>> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:279)
>>  ... 15 more
>> Caused by: java.io.IOException: java.io.IOException: error=7, Argument
>> list too long
>> at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
>>  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>  ... 16 more
>>
>> 2011-03-01 14:46:13,779 WARN org.apache.hadoop.mapred.Child: Error running
>> child
>> java.lang.RuntimeException:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>> processing row
>> (tag=0) {"key":{"reducesinkkey0":"AA11223344","reducesinkkey1":"20110210_02"},"value":{"_col0":"xxxxx","_col1":"m1","_col2":"20110210_02","_col3":"{'m07':
>> 'x12', 'm02': 'x34', 'm01': 'm45'}","_col4":"0A9"},"alias":0}
>>  at
>> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:265)
>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
>>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>>  at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>  at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>> at org.apache.hadoop.mapred.Child.main(Child.java:211)
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
>> Error while processing row
>> (tag=0) {"key":{"reducesinkkey0":"AA11223344","reducesinkkey1":"20110210_02"},"value":{"_col0":"xxxxx","_col1":"m1","_col2":"20110210_02","_col3":"{'m07':
>> 'x12', 'm02': 'x34', 'm01': 'm45'}","_col4":"0A9"},"alias":0}
>>  at
>> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253)
>> ... 7 more
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot
>> initialize ScriptOperator
>> at
>> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:320)
>>  at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457)
>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697)
>>  at
>> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457)
>>  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697)
>> at
>> org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
>>  at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457)
>> at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:244)
>>  ... 7 more
>> Caused by: java.io.IOException: Cannot run program "/usr/bin/python2.6":
>> java.io.IOException: error=7, Argument list too long
>>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>> at
>> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:279)
>>  ... 15 more
>> Caused by: java.io.IOException: java.io.IOException: error=7, Argument
>> list too long
>> at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
>>  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>  ... 16 more
>> 2011-03-01 14:46:13,784 INFO org.apache.hadoop.mapred.Task: Runnning
>> cleanup for the task
>>
>>
>

Mime
View raw message