hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vikas Srivastava <vikas.srivast...@one97.net>
Subject Re: Hive query taking too much time
Date Wed, 07 Dec 2011 06:00:56 GMT
hey if u having the same col of  all the files then you can easily merge by
shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $file >>new_file.csv
done
hive -e "load data local inpath '$file' into table $table"

it will merge all the files in single file then you can upload it in the
same query

On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
<success.mohit.gupta@gmail.com>wrote:

> Hi Paul,
> I am having the same problem. Do you know any efficient way of merging the
> files?
>
> -Mohit
>
>
> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmackles@adobe.com> wrote:
>
>> How much time is it spending in the map/reduce phases, respectively? The
>> large number of files could be creating a lot of mappers which create a lot
>> of overhead. What happens if you merge the 2624 files into a smaller number
>> like 24 or 48. That should speed up the mapper phase significantly.****
>>
>> ** **
>>
>> *From:* Savant, Keshav [mailto:Keshav.C.Savant@fisglobal.com]
>> *Sent:* Tuesday, December 06, 2011 6:01 AM
>> *To:* user@hive.apache.org
>> *Subject:* Hive query taking too much time****
>>
>> ** **
>>
>> Hi All,****
>>
>> ** **
>>
>> My setup is ****
>>
>> hadoop-0.20.203.0****
>>
>> hive-0.7.1****
>>
>> ** **
>>
>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
>> also acting as secondary name node). On namenode I have setup hive with
>> HiveDerbyServerMode to support multiple hive server connection.****
>>
>> ** **
>>
>> I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query
>> statements, total number of files is 2624 an their combined size is only
>> 713 MB, which is very less from Hadoop perspective that can handle TBs of
>> data very easily.****
>>
>> ** **
>>
>> The problem is, when I run a simple count query (i.e. *select count(*)
>> from a_table*), it takes too much time in executing the query.****
>>
>> ** **
>>
>> For instance it takes almost 17 minutes to execute the said query if the
>> table has 950,000 rows, I understand that time is too much for executing a
>> query with only such small data. ****
>>
>> This is only a dev environment and in production environment the number
>> of files and their combined size will move into millions and GBs
>> respectively.****
>>
>> ** **
>>
>> On analyzing the logs on all the datanodes and namenode/secondary
>> namenode I do not find any error in them.****
>>
>> ** **
>>
>> I have tried setting mapred.reduce.tasks to a fixed number also, but
>> number of reduce always remains 1 while number of maps is determined by
>> hive only.****
>>
>> ** **
>>
>> Any suggestion what I am doing wrong, or how can I improve the
>> performance of hive queries? Any suggestion or pointer is highly
>> appreciated. ****
>>
>> ** **
>>
>> Keshav****
>>
>> _____________
>> The information contained in this message is proprietary and/or
>> confidential. If you are not the intended recipient, please: (i) delete the
>> message and all copies; (ii) do not disclose, distribute or use the message
>> in any manner; and (iii) notify the sender immediately. In addition, please
>> be aware that any message addressed to our domain is subject to archiving
>> and review by persons other than the intended recipient. Thank you.****
>>
>
>
>
> --
> Best Regards,
>
> Mohit Gupta
> Software Engineer at Vdopia Inc.
>
>
>


-- 
With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !

Mime
View raw message