hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: question about number of map tasks for small file
Date Wed, 01 Jun 2011 19:37:47 GMT
On Wed, Jun 1, 2011 at 1:12 PM, Igor Tatarinov <> wrote:

> Can you pre-aggregate your historical data to reduce the number of files?
> We used to partition our data by date but that created too many output
> files so now we partition by month.
> I do find it odd that Hive (0.6) can't merge compressed output files. We
> could have gotten away with daily partitioning if Hive could merge small
> files. I tried disabling compression but it actually caused some execution
> problems (perhaps xcievers -related I am not sure)
> On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan <>wrote:
>> Today I tried CombineHiveInputFormat and set the max split size for hadoop
>> input. Seems I can get the expected map tasks number. But another problem is
>> the cpu is consumed highly by map tasks. almost 100%.
>> I just ran a query with simple WHERE condition over testing files,whose
>> total size is about 30M and there are about 10 thousand small files. The
>> execution time over 700s. It's killing us.  Because the files are generated
>> by flume, all files is seq file.
>> R
>> On Tue, May 31, 2011 at 2:55 AM, Junxian Yan <>wrote:
>>> Hi Guys
>>> I use flume to store log file , and use hive to query.
>>> Flume always store the small file with suffix .seq Now I have over 35
>>> thousand seq files. Every time when I launch query script, 35 thousand map
>>> tasks will be created and it's so long time to wait for completing.
>>> I also try to set CombineHiveInputFormat, but if I set this option, it
>>> seems the task will be executed slowly. Because total size of the data
>>> folder over 700M.  Now in my testing env, I only have 3 data nodes. I also
>>> tried to add after the CombineHiveInputFormat setting,
>>> seems doesn't work. There's alway only one map task if
>>> set CombineHiveInputFormat.
>>> Can you plz show me a solution in which I can set map task number freely
>>> BTW: version for hadoop is 20 and hive is 0.5
>>> Richard
We have open sourced our filecrusher/optimizer, you post reminded be to
throw our new V2 version over the open source fence.

I know many are looking for an in-hive solution, but file crusher does the
job for us.


View raw message