hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu, Ming (Ming)" <>
Subject re: Optimizing external table structure
Date Sun, 14 Feb 2016 00:05:55 GMT
Hi, Zack,

Can you tell how many reducer were used? I feel that is also a key factor.
I do not know Phoenix's csvbulkupload utility, is it better than HBase's native importtsv

From what I used 'importtsv', the reducer number is very important, the utility first check
the target hbase table about how many regions there. And start same number of reducer, if
the target hbase table is not pre-created or pre-split, it will start only one reducer, which
is very slow. A good reducer number is the total number of disks. 16*13 in your cases , so
maybe you should presplit your hbase table with that many regions.
And you should also use compression in your target hbase table, which will reduce the total
writing bytes a lot and much better performance.

And as I understand, with 13 nodes, you spent 3 hours to load 2T hive data (6,000 300~400m
src files) into hbase, is this correct? Or 3 hours just finish the mapper's job, not including
reducer? If you just ask how to speed up the mapper speed, sorry that my answer make no sense,
and I don't have any better idea.

As for hdfs file size, I agree with Jorn franke that larger file is better than a lot of small
files. The mapper can still run as many as possible by reading different data blocks, but
the name node will be very happy. But I feel it has very little improvement from bulkloading
point of view. 4G, 40G or 400M should not have visible difference in loading speed. Time spend
with NameNode in the bulk loading process is very tiny part. Not sure, better you can test

As a reference, I have a 8 data nodes cluster, and 5 SSD disks each node. With importtsv,
the best loading speed I can get up to now is about 350G/hour, so it will take me about 6
hours to finish your job. 350G is the raw size of the original csv file, not total size in
HDFS which multiplied by 3 (the replica). If I am right, your loading speed is already double
of mine, so I cannot give you any advices and want to ask for your help :-)

I am struggled to achieve a better bulk loading speed as well. As a developer of Apache Trafodion,
I am also using its bulkloader, and its bulkloader will have about 3x speed up of loading
same data from hive to similar hbase target table than the importtsv. But the target format
is encoded by Trafodion, one can only access those data via Trafodion's SQL interface, JDBC/ODBC
etc. So maybe not an option for you. 

I will try Phoenix's bulkloader asap. I also heard a Spark bulkloader utility, but not able
to try it yet. Hope to keep touch with you.


发件人: Riesland, Zack [] 
发送时间: 2016年2月13日 23:07
主题: RE: Optimizing external table structure


We have 16 disks per node, to answer your question.
From: Jörn Franke []
Sent: Saturday, February 13, 2016 9:46 AM
Subject: Re: Optimizing external table structure

How many disk drives do you have / node?
Generally one node should have 12 drives not configured as raid and not configured as lvm.

Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use
Hadoop Archive (HAR).

I am not sure about the latest status of Phoenix but maybe you can can make hbase tables directly
available as external tables in Hive - you would save a lot of time by not converting to csv.
You could also explore using sqoop (import from hive to jdbc / Phoenix or to hbase).

On 13 Feb 2016, at 13:41, Riesland, Zack <<>>

On a daily basis, we move large amounts of data from hive to hbase, via phoenix.

In order to do this, we create an external hive table with the data we need to move (all a
subset of 1 compressed ORC table), and then use the Phoenix CsvBulkUpload utility. From everything
I've read, this is the best approach.

My question is: how can I optimize my external table to make the bulk upload as efficient
as possible?

For example, today, my external table is backed by 6,020 files in HDFS, each about 300-400mb.

This results in a mapreduce operation with 12,209 mappers that takes about 3 hours (we don't
have a huge cluster – 13 data nodes currently).

Would it be better to have more, smaller files? Fewer, larger files?
View raw message