hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ayon Sinha <ayonsi...@yahoo.com>
Subject Re: Hive query taking too much time
Date Wed, 07 Dec 2011 06:36:02 GMT
How about a simple Pig script with a load and a store statement? Set the max # reducers to
say 20 or 30, that way you will only have 20-30 files as output. Then put these files in the
Hive dir. Make sure to match the delimiters in Hive & Pig.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
 From: Vikas Srivastava <vikas.srivastava@one97.net>
To: user@hive.apache.org 
Sent: Tuesday, December 6, 2011 10:00 PM
Subject: Re: Hive query taking too much time
 

hey if u having the same col of  all the files then you can easily merge by shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $file >>new_file.csv
done
hive -e "load data local inpath '$file' into table $table"

it will merge all the files in single file then you can upload it in the same query


On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <success.mohit.gupta@gmail.com> wrote:

Hi Paul,
>I am having the same problem. Do you know any efficient way of merging the files?
>
>
>-Mohit
>
>
>
>On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmackles@adobe.com> wrote:
>
>How much time is it spending in the map/reduce phases, respectively? The large number
of files could be creating a lot of mappers which create a lot of overhead. What happens if
you merge the 2624 files into a smaller number like 24 or 48. That should speed up the mapper
phase significantly.
>> 
>>From:Savant, Keshav [mailto:Keshav.C.Savant@fisglobal.com] 
>>Sent: Tuesday, December 06, 2011 6:01 AM
>>To: user@hive.apache.org
>>Subject: Hive query taking too much time
>> 
>>Hi All,
>> 
>>My setup is 
>>hadoop-0.20.203.0
>>hive-0.7.1
>> 
>>I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting
as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support
multiple hive server connection.
>> 
>>I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query statements,
total number of files is 2624 an their combined size is only 713 MB, which is very less from
Hadoop perspective that can handle TBs of data very easily.
>> 
>>The problem is, when I run a simple count query (i.e. select count(*) from a_table),
it takes too much time in executing the query.
>> 
>>For instance it takes almost 17 minutes to execute the said query if the table has
950,000 rows, I understand that time is too much for executing a query with only such small
data. 
>>This is only a dev environment and in production environment the number of files and
their combined size will move into millions and GBs respectively.
>> 
>>On analyzing the logs on all the datanodes and namenode/secondary namenode I do not
find any error in them.
>> 
>>I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce
always remains 1 while number of maps is determined by hive only.
>> 
>>Any suggestion what I am doing wrong, or how can I improve the performance of hive
queries? Any suggestion or pointer is highly appreciated. 
>> 
>>Keshav
>>_____________
>>The information contained in this message is proprietary and/or confidential. If you
are not the intended recipient, please: (i) delete the message and all copies; (ii) do not
disclose, distribute or use the message in any manner; and (iii) notify the sender immediately.
In addition, please be aware that any message addressed to our domain is subject to archiving
and review by persons other than the intended recipient. Thank you.
>
>
>
>-- 
>Best Regards,
>
>Mohit Gupta
>Software Engineer at Vdopia Inc.
>
>
>


-- 
With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !
Mime
View raw message