hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Format dillema
Date Fri, 23 Jun 2017 21:27:14 GMT

> I guess I see different things. Having used all the tech. In particular for large hive
queries I see OOM simply SCANNING THE INPUT of a data directory, after 20 seconds! 

If you've got an LLAP deployment you're not happy with - this list is the right place to air
your grievances. I usually try help anybody who haven't quite figured their way around it
& I'm always happy to debug performance issues across the board.

But, if your definition of "all the tech" is text files and gzip on MR, then Datanami has
an article out today.

And my favourite quote is the side-bar image - MapReduce SQL query performance is “a dumpster

> "Magically" jk. Impala allow me to query those TEXT files in milliseconds, so logical
deduction says the format of the data ORC/TEXT can't be the most important factor here.

You're right - the most important factor is the execution engine for Hive. That drives the
logical plans and optimizations specific to those engines, which are definitely more significant.

Picking the best Hive engine is the most important factor, if you're doing ETL or BI  - Comcast
got Parquet on LLAP to work fast as well.

Stinger wasn't just about ORC or Vectorization or ACID or Tez, but to be greater than sum
of the parts for an EDW.

Your milliseconds comment reminded me of an off-hand comment made by the Yahoo Japan folks
about the time they tried Impala out for their 700 node use-case.

Impala did very well at low workloads. But low load is a very artificial condition in a well
utilized cluster. Production clusters usually run at near-full utilization or they're leaving
money on the table.

This is also a scale problem, running at 50% utilization at 2 nodes is very different from
running at 50% on 50, in real dollars.

>From their utilization tests, Yahoo Japan detailed their experience running LLAP with
300+ concurrent queries last year (with *NO* query failures, instead of "fast or die").

> 2 impala server (each node)

The problem with making up your mind based on a 2 node cluster, is that scale problems are
somewhat like the tides - huge and massive, ignore it at your risk, but impossible to measure
in a swimming pool.

If you want, I can go into details about what is different about 3-replica HDFS data on 700
nodes vs 2 nodes - also imagine that your last hour is "hot" in the BI dashboard (and that
means 4 HDFS blocks are 85% of IO reqs, which in an MPP database model runs out of capacity
at 0.5% disk IO utilization) .

If you're happy with Impala at the single digit node scale, that's useful to know - but it
does not extrapolate naturally.


View raw message