hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Network throughput from HiveServer2 to JDBC client too low
Date Tue, 21 Jun 2016 07:26:36 GMT
this is a classic issue. are there other users using the same network to
connect to Hive.

Can your unix admin use a network sniffer to determine the issue with your
case?

in normal operations with modest amount of data do you see the same issue
or this is purely due to your load (the number of rows returned) of 100M
rows.

Yes I noticed your version of Hive at 1.1 on a vendor's package.

At this stage the question is what other alternatives are there to fetch
that 100Miilom rows.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 08:15, David Nies <david.nies@adition.com> wrote:

>
>
> Am 20.06.2016 um 20:20 schrieb Gopal Vijayaraghavan <gopalv@apache.org>:
>
>
> is hosting the HiveServer2 is merely sending data with around 3 MB/sec.
> Our network is capable of much more. Playing around with `fetchSize` did
> not increase throughput.
>
> ...
>
> --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
> \
>
>
> The current implementation you have is CPU bound in HiveServer2, the
> compression generally makes it worse.
>
> The fetch size does help, but it only prevents the system from doing
> synchronized operations frequently (pausing every 50 rows is too often,
> the default is now 10000 rows).
>
>   -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
>
> Quick q - are year/month/day/hour partition columns? If so, there might be
> a very different fix to this problem.
>
>
> Yes, year, month, day and hour are partition columns. I.e. I want to
> export exactly one partition. In my real use case, I want to use another
> filter (WHERE some_other_column = <x>), but for this case right here, it is
> exactly the data of one partition I want.
>
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
>
> A series of work-items are in progress for fixing the large row-set
> performance in HiveServer2
>
> https://issues.apache.org/jira/browse/HIVE-11527
>
> https://issues.apache.org/jira/browse/HIVE-12427
>
> What would be great would be to attach a profiler to your HiveServer2 &
> see which functions are hot, that will help fix those codepaths as part of
> the joint effort with the ODBC driver teams.
>
>
> I’ll see what I can do. I can’t restart the server at will though, since
> other teams are using it as well.
>
>
> Cheers,
> Gopal
>
>
> Thank you :)
> -David
>
>

Mime
View raw message