hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wojciech Langiewicz <wlangiew...@gmail.com>
Subject Re: Hive query taking too much time
Date Wed, 07 Dec 2011 14:45:09 GMT
Hi,
In this case it's much easier and faster to merge all files using this 
command:

cat *.csv > output.csv
hive -e "load data local inpath 'output.csv' into table $table"

On 07.12.2011 07:00, Vikas Srivastava wrote:
> hey if u having the same col of  all the files then you can easily merge by
> shell script
>
> list=`*.csv`
> $table=yourtable
> for file in $list
> do
> cat $file>>new_file.csv
> done
> hive -e "load data local inpath '$file' into table $table"
>
> it will merge all the files in single file then you can upload it in the
> same query
>
> On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
> <success.mohit.gupta@gmail.com>wrote:
>
>> Hi Paul,
>> I am having the same problem. Do you know any efficient way of merging the
>> files?
>>
>> -Mohit
>>
>>
>> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<pmackles@adobe.com>  wrote:
>>
>>> How much time is it spending in the map/reduce phases, respectively? The
>>> large number of files could be creating a lot of mappers which create a lot
>>> of overhead. What happens if you merge the 2624 files into a smaller number
>>> like 24 or 48. That should speed up the mapper phase significantly.****
>>>
>>> ** **
>>>
>>> *From:* Savant, Keshav [mailto:Keshav.C.Savant@fisglobal.com]
>>> *Sent:* Tuesday, December 06, 2011 6:01 AM
>>> *To:* user@hive.apache.org
>>> *Subject:* Hive query taking too much time****
>>>
>>> ** **
>>>
>>> Hi All,****
>>>
>>> ** **
>>>
>>> My setup is ****
>>>
>>> hadoop-0.20.203.0****
>>>
>>> hive-0.7.1****
>>>
>>> ** **
>>>
>>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
>>> also acting as secondary name node). On namenode I have setup hive with
>>> HiveDerbyServerMode to support multiple hive server connection.****
>>>
>>> ** **
>>>
>>> I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query
>>> statements, total number of files is 2624 an their combined size is only
>>> 713 MB, which is very less from Hadoop perspective that can handle TBs of
>>> data very easily.****
>>>
>>> ** **
>>>
>>> The problem is, when I run a simple count query (i.e. *select count(*)
>>> from a_table*), it takes too much time in executing the query.****
>>>
>>> ** **
>>>
>>> For instance it takes almost 17 minutes to execute the said query if the
>>> table has 950,000 rows, I understand that time is too much for executing a
>>> query with only such small data. ****
>>>
>>> This is only a dev environment and in production environment the number
>>> of files and their combined size will move into millions and GBs
>>> respectively.****
>>>
>>> ** **
>>>
>>> On analyzing the logs on all the datanodes and namenode/secondary
>>> namenode I do not find any error in them.****
>>>
>>> ** **
>>>
>>> I have tried setting mapred.reduce.tasks to a fixed number also, but
>>> number of reduce always remains 1 while number of maps is determined by
>>> hive only.****
>>>
>>> ** **
>>>
>>> Any suggestion what I am doing wrong, or how can I improve the
>>> performance of hive queries? Any suggestion or pointer is highly
>>> appreciated. ****
>>>
>>> ** **
>>>
>>> Keshav****
>>>
>>> _____________
>>> The information contained in this message is proprietary and/or
>>> confidential. If you are not the intended recipient, please: (i) delete the
>>> message and all copies; (ii) do not disclose, distribute or use the message
>>> in any manner; and (iii) notify the sender immediately. In addition, please
>>> be aware that any message addressed to our domain is subject to archiving
>>> and review by persons other than the intended recipient. Thank you.****
>>>
>>
>>
>>
>> --
>> Best Regards,
>>
>> Mohit Gupta
>> Software Engineer at Vdopia Inc.
>>
>>
>>
>
>


Mime
View raw message