hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Connell, Chuck" <Chuck.Conn...@nuance.com>
Subject RE: Performance Issues in Hive with S3 and Partitions
Date Fri, 27 Jul 2012 19:39:37 GMT
What about making your small files bigger, by ZIPping them together? Of course, you have to
think about this carefully, so MapReduce can efficiently retrieve the files it needs without
unzipping everything every time.

Chuck


From: richin.jain@nokia.com [mailto:richin.jain@nokia.com]
Sent: Friday, July 27, 2012 2:42 PM
To: user@hive.apache.org
Subject: RE: Performance Issues in Hive with S3 and Partitions

Igor,

I did not see any major improvement in the performance even after setting "Hive.optimize.s3.query=true",
although the same was suggested by AWS Team.

My problem is I have too many small files - 3 level of partition, 6500+ files and a single
file is < 1 MB.
Now I know Hadoop and HDFS are not meant to deal with lot of small files, but if that is the
way to go is there any work around?

Thanks,
Richin

From: Jain Richin (Nokia-LC/Boston)
Sent: Tuesday, July 24, 2012 11:49 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Performance Issues in Hive with S3 and Partitions

Hi Igor,

Thanks for the response. Yes I am using EMR.
I will make changes and let you know if that helps.

Richin

From: ext Igor Tatarinov [mailto:igor@decide.com]<mailto:[mailto:igor@decide.com]>
Sent: Tuesday, July 24, 2012 12:38 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Performance Issues in Hive with S3 and Partitions

Are you using EMR?
Have you tried  setting
Hive.optimize.s3.query=true

as mentioned in
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html

I haven't tried using that option myself. I am curious if it helps in your scenario. The above
page also mentions another fix that's supposed to help with partitioned tables. Optimizing
queries with thousands of input files used to take a lot of time. But it looks like that fix
is enabled by default now.

Just in case, also check your jvm reuse option. If it's too low, performance will suffer.
I had it set to 3 to avoid running out of memory. Using the default value of 20 really helps
when reading lots of small files.

igor
decide.com<http://decide.com>
On Mon, Jul 23, 2012 at 8:33 PM, <richin.jain@nokia.com<mailto:richin.jain@nokia.com>>
wrote:
Hi,

Sorry this is an AWS Hive Specific question.  I have two External Hive tables for my custom
logs.

1. flat directory structure on AWS S3, no partition and files in bz2 compressed format (few
big files)

2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)

I noticed that my queries on the table with Partition is taking forever to run. The same queries
run fine and finish up quickly on table with no partition.
Am I missing something, I suspect this has something to do with the way S3 behaves.

A query example is :

select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - min(unix_timestamp(ts, "MM/dd/yyyy
HH:mm")))/(60*60)
from logs
group by id;

Thanks,
Richin


Mime
View raw message