hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Network throughput from HiveServer2 to JDBC client too low
Date Mon, 20 Jun 2016 15:28:36 GMT
Hallo,

For no databases (including traditional ones) it is advisable to fetch this amount through
jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes).
It is a highly questionable approach from a reliability point of view.

Export it as file to HDFS and fetch it from there or use oozie to dump the file from HDFS
to a sftp or other server. There are alternatives depending on your use case.

Best regards

> On 20 Jun 2016, at 16:43, David Nies <david.nies@adition.com> wrote:
> 
> Dear Hive mailing list,
> 
> in my setup, network throughput from the HiveServer2 to the client seems to be the bottleneck
and I’m seeking a way do increase throughput. Let me elaborate my use case:
> 
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
> 
> I want to fetch a huge amount of data from our Hive cluster. By huge I mean something
around 100 million rows. The Hive table I’m querying is an external table whose data is
stored in .avro. On HDFS, the data I want to fetch (i.e. the aforementioned 100 million rows)
is about 5GB in size. A cleverer filtering strategy (to reduce the amount of data) is no option,
sadly, since I need all the data.
> 
> I was able to reduce the time the MapReduce job takes to an agreeable interval fiddling
around with `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages
comes after MapReduce. I’m observing that the Hadoop namenode that is hosting the HiveServer2
is merely sending data with around 3 MB/sec. Our network is capable of much more. Playing
around with `fetchSize` did not increase throughput. 
> 
> As I identified network throughput to be the bottleneck, I restricted my efforts to trying
to increase it. For this, I simply run the query I’d normally run through JDBC (from Clojure/Java)
via `beeline` and dumping the output to `/dev/null`. My `beeline` query looks something like
that:
> 
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6
AND day=1 AND hour=10)' > /dev/null
> 
> I already tried playing around with additional `—hiveconf`s:
> 
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
\
> 
> without success. 
> 
> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth that is available.
Is there a possibility to increase network throughput?
> 
> Thank you in advance!
> 
> Yours
> 
> David Nies
> Entwickler Business Intelligence
>  
> ADITION technologies AG
>  
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>  
> T +49 211 987400 30
> F +49 211 987400 33
> E david.nies@adition.com
>  
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>  
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>  
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
> UStIDNr.: DE 218 858 434
> 

Mime
View raw message