hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Connell, Chuck" <Chuck.Conn...@nuance.com>
Subject RE: Hive File Sizes, Merging, and Splits
Date Tue, 25 Sep 2012 19:35:49 GMT
Why do you think the current generated code is inefficient?



From: John Omernik [mailto:john@omernik.com]
Sent: Tuesday, September 25, 2012 2:57 PM
To: user@hive.apache.org
Subject: Hive File Sizes, Merging, and Splits

I am really struggling trying to make hears or tails out of how to optimize the data in my
tables for best query times.  I have a partition that is compressed (Gzip) RCFile data in
two files

total 421877
263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 000000_0
158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 000001_0



No matter what I set my split settings to prior to the job, I always get three mappers.  My
block size is 268435456 but the setting doesn't seem to change anything. I can set split size
huge or small with no apparent affect on the data.


I know there are many esoteric items here, but is there any good documentation on setting
these things to make my queries on this data more efficient. I am not sure what it needs three
map tasks on this data, it should really just grab two mappers. Not to mention, I thought
gzip wasn't splitable anyhow.  So, from that standpoint, how does it even send data to three
mappers.  If you know of some secret cache of documentation for hive, I'd love to read it.

Thanks


Mime
View raw message