hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nishant Kelkar <nishant....@gmail.com>
Subject Re: Optimising Map and Reduce numbers for Compressed Dataset
Date Wed, 13 Aug 2014 16:26:14 GMT
Maybe try this at the Hive terminal:

SET mapreduce.input.fileinputformat.split.maxsize=your_split_size;

Where "your_split_size" = SUM(all small file sizes) / #mappers you'd like

Thanks and Regards,
Nishant


On Wed, Aug 13, 2014 at 4:00 AM, Ana Gillan <ana.gillan@gmail.com> wrote:

> Hi,
>
> I am currently experimenting with using Hive for a large dataset and I am
> having trouble with optimising the jobs.
>
> The dataset I have is quite a large number of fairly small gzipped files,
> but I am carrying out a series of transformations on it, which means that
> the size of the data being processed by the mappers and reducers is
> significantly larger than the input data. As such, a very small number of
> mappers and reducers are launched, but it takes a very long time to finish
> any job.
>
> Hive is using CombineFileInputFormat and I have also set
> hive.hadoop.supports.splittable.combineinputformat to true because the
> files are compressed. What other settings should I be looking at? Is there
> a way to specify the number of map tasks?
>
> Thanks,
> Ana
>

Mime
View raw message