hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ana Gillan <>
Subject Re: Optimising Map and Reduce numbers for Compressed Dataset
Date Wed, 13 Aug 2014 19:36:02 GMT
Thanks so much, Nishant!

I set the mapreduce.input.fileinputformat.split.maxsize as you suggested and
the queries spawned more mappers, which made the job finish significantly
quicker! (For anyone looking at this, this setting need to be in bytes).

As for creating the HAR, the external table on the archive read the files
correctly, but then the next transformation failed with a cannot find dir =
har://hdfs-cluster:8020/user/usnm/archived/onek.har/file.xml.gz in
pathToPartitionInfo: [har:/user/usnm/archived/onek.har]

but it doesn¹t really matter ­ the split size gave me the functionality I
wanted! Thanks again!

All best,

From:  Nishant Kelkar <>
Date:  Wednesday, 13 August 2014 17:49
To:  Ana Gillan <>
Subject:  Re: Optimising Map and Reduce numbers for Compressed Dataset

This could give a problem since you're using CombineInputFormat and not
FileInputFormat. But the prior extends the latter, so I'd guess it should be
fine? Another thing you could try, is to group all the small files into one
large Hadoop archive (HAR). You could then try to build off a table of this
single HAR file. More on that over here:

TERMINATED BY '\t' LOCATION 'har://user/path/to/data.har' STORED AS

where your_input_format_type is mostly
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat" if you're using
plain text.

I'm thinking this should work?

Let me know! 

Thanks and Regards,

On Wed, Aug 13, 2014 at 9:26 AM, Nishant Kelkar <>
> Maybe try this at the Hive terminal:
> SET mapreduce.input.fileinputformat.split.maxsize=your_split_size;
> Where "your_split_size" = SUM(all small file sizes) / #mappers you'd like
> Thanks and Regards,
> Nishant
> On Wed, Aug 13, 2014 at 4:00 AM, Ana Gillan <> wrote:
>> Hi,
>> I am currently experimenting with using Hive for a large dataset and I am
>> having trouble with optimising the jobs.
>> The dataset I have is quite a large number of fairly small gzipped files, but
>> I am carrying out a series of transformations on it, which means that the
>> size of the data being processed by the mappers and reducers is significantly
>> larger than the input data. As such, a very small number of mappers and
>> reducers are launched, but it takes a very long time to finish any job.
>> Hive is using CombineFileInputFormat and I have also set
>> hive.hadoop.supports.splittable.combineinputformat to true because the files
>> are compressed. What other settings should I be looking at? Is there a way to
>> specify the number of map tasks?
>> Thanks,
>> Ana

View raw message