hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wojciech Langiewicz <wlangiew...@gmail.com>
Subject Re: Hive query taking too much time
Date Thu, 08 Dec 2011 10:30:48 GMT
Using CombineFileInputFormat might help, but it still creates overhead 
when you hold many small files in HDFS.

I don't know details of your requirements, but but option 2 seems to be 
better, make sure that X is at least size of few blocks in HDFS.

You could also merge files incrementally, like first every 1h, then 
merge those results again after 12h and so on.

You can use -getmerge option or use this class (I have not used it):
http://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/tools/HDFSConcat.html


On 08.12.2011 09:03, Aniket Mokashi wrote:
> You can also take a look at--
> https://issues.apache.org/jira/browse/HIVE-74
>
> On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav<
> Keshav.C.Savant@fisglobal.com>  wrote:
>
>> You are right Wojciech Langiewicz, we did the same thing and posted my
>> result yesterday. Now we are planning to do this using a shell script
>> because of dynamicity of our environment where file keep on coming. We
>> will schedule the shell script using cron job.
>>
>> A query on this, we are planning to merge files based on either of the
>> following approach
>> 1. Based on file count: If file count goes to X number of files, then
>> merge and insert in HDFS.
>> 2. Based on merged file size: If merged file size crosses beyond X
>> number of bytes, then insert into HDFS.
>>
>> I think option 2 is better because in that way we can say that all
>> merged files will be almost of same bytes. What do you suggest?
>>
>> Kind Regards,
>> Keshav C Savant
>>
>>
>> -----Original Message-----
>> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
>> Sent: Wednesday, December 07, 2011 8:15 PM
>> To: user@hive.apache.org
>> Subject: Re: Hive query taking too much time
>>
>> Hi,
>> In this case it's much easier and faster to merge all files using this
>> command:
>>
>> cat *.csv>  output.csv
>> hive -e "load data local inpath 'output.csv' into table $table"
>>
>> On 07.12.2011 07:00, Vikas Srivastava wrote:
>>> hey if u having the same col of  all the files then you can easily
>>> merge by shell script
>>>
>>> list=`*.csv`
>>> $table=yourtable
>>> for file in $list
>>> do
>>> cat $file>>new_file.csv
>>> done
>>> hive -e "load data local inpath '$file' into table $table"
>>>
>>> it will merge all the files in single file then you can upload it in
>>> the same query
>>>
>>> On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
>>> <success.mohit.gupta@gmail.com>wrote:
>>>
>>>> Hi Paul,
>>>> I am having the same problem. Do you know any efficient way of
>>>> merging the files?
>>>>
>>>> -Mohit
>>>>
>>>>
>>>> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<pmackles@adobe.com>
>> wrote:
>>>>
>>>>> How much time is it spending in the map/reduce phases, respectively?
>>
>>>>> The large number of files could be creating a lot of mappers which
>>>>> create a lot of overhead. What happens if you merge the 2624 files
>>>>> into a smaller number like 24 or 48. That should speed up the mapper
>>
>>>>> phase significantly.****
>>>>>
>>>>> ** **
>>>>>
>>>>> *From:* Savant, Keshav [mailto:Keshav.C.Savant@fisglobal.com]
>>>>> *Sent:* Tuesday, December 06, 2011 6:01 AM
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* Hive query taking too much time****
>>>>>
>>>>> ** **
>>>>>
>>>>> Hi All,****
>>>>>
>>>>> ** **
>>>>>
>>>>> My setup is ****
>>>>>
>>>>> hadoop-0.20.203.0****
>>>>>
>>>>> hive-0.7.1****
>>>>>
>>>>> ** **
>>>>>
>>>>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it
>>>>> is also acting as secondary name node). On namenode I have setup
>>>>> hive with HiveDerbyServerMode to support multiple hive server
>>>>> connection.****
>>>>>
>>>>> ** **
>>>>>
>>>>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
>>>>> query statements, total number of files is 2624 an their combined
>>>>> size is only
>>>>> 713 MB, which is very less from Hadoop perspective that can handle
>>>>> TBs of data very easily.****
>>>>>
>>>>> ** **
>>>>>
>>>>> The problem is, when I run a simple count query (i.e. *select
>>>>> count(*) from a_table*), it takes too much time in executing the
>>>>> query.****
>>>>>
>>>>> ** **
>>>>>
>>>>> For instance it takes almost 17 minutes to execute the said query if
>>
>>>>> the table has 950,000 rows, I understand that time is too much for
>>>>> executing a query with only such small data. ****
>>>>>
>>>>> This is only a dev environment and in production environment the
>>>>> number of files and their combined size will move into millions and
>>>>> GBs
>>>>> respectively.****
>>>>>
>>>>> ** **
>>>>>
>>>>> On analyzing the logs on all the datanodes and namenode/secondary
>>>>> namenode I do not find any error in them.****
>>>>>
>>>>> ** **
>>>>>
>>>>> I have tried setting mapred.reduce.tasks to a fixed number also, but
>>
>>>>> number of reduce always remains 1 while number of maps is determined
>>
>>>>> by hive only.****
>>>>>
>>>>> ** **
>>>>>
>>>>> Any suggestion what I am doing wrong, or how can I improve the
>>>>> performance of hive queries? Any suggestion or pointer is highly
>>>>> appreciated. ****
>>>>>
>>>>> ** **
>>>>>
>>>>> Keshav****
>>>>>
>>>>> _____________
>>>>> The information contained in this message is proprietary and/or
>>>>> confidential. If you are not the intended recipient, please: (i)
>>>>> delete the message and all copies; (ii) do not disclose, distribute
>>>>> or use the message in any manner; and (iii) notify the sender
>>>>> immediately. In addition, please be aware that any message addressed
>>
>>>>> to our domain is subject to archiving and review by persons other
>>>>> than the intended recipient. Thank you.****
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>>
>>>> Mohit Gupta
>>>> Software Engineer at Vdopia Inc.
>>>>
>>>>
>>>>
>>>
>>>
>>
>> _____________
>> The information contained in this message is proprietary and/or
>> confidential. If you are not the intended recipient, please: (i) delete the
>> message and all copies; (ii) do not disclose, distribute or use the message
>> in any manner; and (iii) notify the sender immediately. In addition, please
>> be aware that any message addressed to our domain is subject to archiving
>> and review by persons other than the intended recipient. Thank you.
>>
>
>
>


Mime
View raw message