impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maurin Lenglart <mau...@cuberonlabs.com>
Subject Re: low CPU usage
Date Fri, 10 Mar 2017 23:38:37 GMT
Hi,
You can look at the MT_DOP flag (only avaible in the latest impala release 2.8 with cdh 5.10)
It enable multi-threading for queries that are not using joins. 

On 3/10/17, 9:52 AM, "Joseph Naegele" <jnaegele@grierforensics.com> wrote:

    Hi folks,
    
    I'm running a very rough comparison of Spark and Impala for a specific operation in our
application. I have a 2+ billion record table on which I run a window function to remove duplicate
records. AFAIK this is similar to a self-join. The data doesn't fit entirely in memory but
Spark is able to spill to disk and complete the operation in 40 minutes on this one machine
while the equivalent query in Impala takes 2.3 hours. The machine has approx. 250 GB RAM and
56 vcores. While running in Impala I observe very low (1-core) CPU usage throughout.
    
    My dataset is around 52 GB compressed (2-3x that uncompressed) in Parquet+Gzip format
spread across 1200 files. My query looks like this:
    
    CREATE TABLE compacted STORED AS PARQUET LOCATION "/tmp/compacted" AS
    SELECT a, b, c, d, e FROM (
        SELECT row_number() OVER (
            PARTITION BY c, d ORDER BY e DESC
        ) rank, * FROM duplicates
    ) q WHERE q.rank=1;
    
    Where fields a, b, c, ... are strings or long integers. Also, I've run COMPUTE STATS on
the table.
    
    I'm using the Cloudera quickstart Docker image. Impalad reports version as impalad version
2.5.0-cdh5.7.0 RELEASE. I've allocated 6 local disks to Impalad's scratch space (since I expect
spillage). Otherwise the configuration is vanilla. I'm monitoring CPU usage using htop and
CPU utilization fluctuates between 1/56 and 5/56 of available CPU. I have the Impala query
profile available in the web UI, but I wanted to ask if there's anything obvious I'm missing
before posting too much detail.
    
    Is there anything obvious I should look into or configure to achieve higher CPU utilization
and faster overall performance?
    
    Thanks
    ---
    Joe Naegele
    Grier Forensics
    
    
    

Mime
View raw message