hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Handling LZO files
Date Thu, 03 Dec 2015 14:28:20 GMT
Your Hive version is too old. You may want to use also another execution engine. I think your
problem might then be related to external tables for which the parameter you set probably
do not apply. I had once the same problem, but I needed to change the block size on the Hadoop
level (hdfs-site.xml) or on the Hive level (hive-site.xml). It was definitely not possible
as part of a hive session (set ...). I would need to check the documentation.
In any case , loading it into ORC or parquet makes a lot of sense, but only with a recent
Hive version and tez or spark as an execution engine.

> On 03 Dec 2015, at 14:58, Harsha HN <> wrote:
> Hi Franke,
> It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available when I ran
my job. So infrastructure is not a problem here. 
> Hive version is 0.13
> About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or PARQUET format.
Though it might be performance efficient, it increases data redundancy. 
> But we will explore that option. 
> Currently I want to understand when I am unable to scale up mappers.
> Thanks,
> Harsha
>> On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <> wrote:
>> How many nodes, cores and memory do you have?
>> What hive version?
>> Do you have the opportunity to use tez as an execution engine?
>> Usually  I use external tables only for reading them and inserting them into a table
in Orc or parquet format for doing analytics.
>> This is much more performant than json or any other text-based format.
>>> On 03 Dec 2015, at 14:20, Harsha HN <> wrote:
>>> Hi,
>>> We have LZO compressed JSON files in our HDFS locations. I am creating an "External"
table on the data in HDFS for the purpose of analytics. 
>>> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 MB
respectively along with their index files. 
>>> When I run count(*) query on the table I observe only 10 mappers causing performance
>>> I even tried following, (going for 30MB split)
>>>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
>>> 2) set dfs.blocksize=31457280;
>>> But still I am getting 10 mappers.
>>> Can you please guide me in fixing the same?
>>> Thanks,
>>> Sree Harsha

View raw message