hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bogala, Chandra Reddy" <Chandra.Bog...@gs.com>
Subject RE: Hadoop streaming with insert dynamic partition generate many small files
Date Wed, 26 Feb 2014 15:41:18 GMT
Hi,
  I tried with absolute path. It works fine. But I have another issue. I have below data in
string format. I am trying to convert/cast that data to other format using transform. But
it's not doing properly. What might be the issue?


10347   [{"protocol":"udp","sum_bytes":20897,"sum_packets":61,"sum_flows":35,"rank":1}, {"protocol":"tcp","sum_bytes":20469,"sum_packets":229,"sum_flows":10,"rank":2},
{"protocol":"icmp","sum_bytes":828,"sum_packets":13,"sum_flows":9,"rank":3}]


Transform Query:
from mytable2 select transform(x1,x2) using '/bin/cat' as (x1 int,x2 ARRAY<struct<protocol:string,sum_bytes:bigint,sum_packets:bigint,sum_flows:bigint,rank:int>>);

10347   [{"protocol":"[{\"protocol\":\"udp\",\"sum_bytes\":20897,\"sum_packets\":61,\"sum_flows\":35,\"rank\":1},
{\"protocol\":\"tcp\",\"sum_bytes\":20469,\"sum_packets\":229,\"sum_flows\":10,\"rank\":2},
{\"protocol\":\"icmp\",\"sum_bytes\":828,\"sum_packets\":13,\"sum_flows\":9,\"rank\":3}]","sum_bytes":null,"sum_packets":null,"sum_flows":null,"rank":null}]


From: Bogala, Chandra Reddy [Tech]
Sent: Monday, February 17, 2014 11:15 PM
To: 'user@hive.apache.org'
Subject: RE: Hadoop streaming with insert dynamic partition generate many small files

As I suspected, hive is expecting java in path. But in our clusters java is not in path. So
I tried by specifying absolute path ('/xxx/yyy/jre/bin/java  -cp .:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce').
But it's throwing a different exception ( script error).Is it a bug?.  What's the reason absolute
path not accepted in below stream reduce.

From: Bogala, Chandra Reddy [Tech]
Sent: Thursday, February 13, 2014 10:42 PM
To: 'user@hive.apache.org'
Subject: RE: Hadoop streaming with insert dynamic partition generate many small files

Thanks Wang. I have implemented  reducer in java and trying to run with below job. But its
failing with "java.io.IOException: error=2, No such file or directory" .
I am thinking it may be due to not able to find jar/java in path. Right?

Hive Job:
add jar /home/xxxxx/embeddedDoc.jar;

from (from xxx_aggregation_struct_type as mytable
map mytable.tag, mytable.proto_agg
using '/bin/cat' as c1,c2
cluster by c1) mo
insert overwrite table mytable2
reduce mo.c1, mo.c2
using 'java -cp .:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce'
as x1, x2;



Exception:
------------------------------------
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:258)
        ... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20000]: Unable to initialize
custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:367)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
        ... 7 more
Caused by: java.io.IOException: Cannot run program "java": error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:326)
        ... 15 more
Caused by: java.io.IOException: error=2, No such file or directory
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
        at java.lang.ProcessImpl.start(ProcessImpl.java:130)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
        ... 16 more


FAILED: Execution Error, return code 20000 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask.
Unable to initialize custom script.

Thanks,
Chandra

From: Chen Wang [mailto:chen.apache.solr@gmail.com]
Sent: Tuesday, February 04, 2014 3:00 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hadoop streaming with insert dynamic partition generate many small files

Chandra,
You don't necessary need java to implement the mapper/reducer. Checkout the answer in this
post:
http://stackoverflow.com/questions/6178614/custom-map-reduce-program-on-hive-whats-the-rulehow-about-input-and-output

also in my sample,
A.column1, A.column2 ==> mymapper ==> key, value, and myapper simply read from std.in<http://std.in>,
and convert to key,value.
Chen

On Mon, Feb 3, 2014 at 5:51 AM, Bogala, Chandra Reddy <Chandra.Bogala@gs.com<mailto:Chandra.Bogala@gs.com>>
wrote:
Hi Wang,

    I am first time trying MAP & Reduce inside hive query. Is it possible to share mymapper
and myreducer code? So that I can understand how the columns (A.column1,A.... to key, value)
converted? Also can you point me to some documents to read more about it.
Thanks,
Chandra


From: Chen Wang [mailto:chen.apache.solr@gmail.com<mailto:chen.apache.solr@gmail.com>]
Sent: Monday, February 03, 2014 12:26 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hadoop streaming with insert dynamic partition generate many small files

 it seems that hive.exec.reducers.bytes.per.reducer is still not big enough: I added another
0, and now i only gets one file under each partition.

On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang <chen.apache.solr@gmail.com<mailto:chen.apache.solr@gmail.com>>
wrote:
Hi,
I am using java reducer reading from a table, and then write to another one:

  FROM (

                FROM (

                    SELECT column1,...

                    FROM table1

                    WHERE  ( partition > 6 and partition < 12 )

                ) A

                MAP A.column1,A....

                USING 'java -cp .my.jar  mymapper.mymapper'

                AS key, value

                CLUSTER BY key

            ) map_output

            INSERT OVERWRITE TABLE target_table PARTITION(partition)

            REDUCE

                map_output.key,

                map_output.value

            USING 'java -cp .:myjar.jar  myreducer.myreducer'

            AS column1,column2;"

Its all working fine, except that there are many (20-30) small files generated under each
partition. i am setting  SET hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to
get one big enough file under for each partition.But it does not seem to have any effect.
I still get 20-30 small files under each folder, and each file size is around 7kb.

How can I force to generate only 1 big file for one partition? Does this have anything to
do with the streaming? I recall in the past i was directly reading from a table with UDF,
and write to another table, it only generates one big file for the target partition. Not sure
why is that.



Any help appreciated.

Thanks,

Chen







Mime
View raw message