hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Network throughput from HiveServer2 to JDBC client too low
Date Mon, 20 Jun 2016 15:41:12 GMT
Hi David,

What are you actually trying to do with the data.

Hive and map-reduce are notoriously slow for this type of operations. Hive
is good for storage that is what I vouch for.

There are other alternatives.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 June 2016 at 15:43, David Nies <david.nies@adition.com> wrote:

> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.nies@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>

Mime
View raw message