impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mostafa Mokhtar <mmokh...@cloudera.com>
Subject Re: Debugging slow Impala hdfs-scans
Date Wed, 06 Dec 2017 22:51:55 GMT
The query profile shows that a lot of time is spent in "-
TotalStorageWaitTime: 20h4m", usually this is an indicator that Impala is
waiting on IO from HDFS.

Number of files can also be an issue, I recommend checking GC time for the
HDFS NameNode.

The filter is not selective so it is interesting to see that Presto is
running the same query faster, did you observe IO rates while Presto was
running or the data was read from the file system cache?



On Wed, Dec 6, 2017 at 2:01 PM, Piyush Narang <p.narang@criteo.com> wrote:

> Hi folks,
>
>
>
> I’m trying to debug why the performance of our new Impala setup is
> performing a bit worse on the same queries as Presto. We’re running Impala
> 2.10.0 and starting it up with 225G of memory (-mem-limit). It runs on the
> same set of nodes as Presto (4 nodes, 48 cores each, 10G ethernet). We
> noticed a couple of queries in Impala were fairly slow in the hdfs-scan
> stage so we tried to isolate the behavior with a slightly simpler query:
>
> select max(hour + nb_display) from my_large_parquet_table where day =
> '2017-10-04';
>
>
>
> This query takes around 28-30 mins on Impala and seems to run in around
> 8.5 mins on Presto. The input table is 11.64TB and uses Parquet (snappy
> compressed).
>
>
>
> Looking at the performance profile (attached in case anyone’s interested),
> I noticed that during the query Impala seems to be averaging around 1.1 -
> 1.2 MB/s per thread and the total read throughput is around 2.13 MB/s. I
> tried bumping up the number of scanner threads (from 48 to 96) but that
> seemed to only help marginally (improved runtimes by ~10s). Running “sar –n
> DEV 1” on some of our hosts while the query is running, seems to show that
> Impala is reading at a rate of 30-50 MB/s (whereas we see this go up to
> 60-125 MB/s on our other runs).
>
>
>
> We haven’t tweaked our Impala setup much beyond the defaults that come out
> of the box. I’m wondering if I’m missing some tuning settings that help
> improve read rates from HDFS when Impala is running outside of the
> datanodes. If anyone has any ideas / suggestions, they’d be welcome. I am
> happy to provide more details if needed as well.
>
>
>
> Thanks,
>
>
>
> -- Piyush
>
>
>

Mime
View raw message