impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IMPALA-5437) Codegen for Trunc() of timestamp takes far too long
Date Mon, 05 Jun 2017 18:08:04 GMT
Tim Armstrong created IMPALA-5437:
-------------------------------------

             Summary: Codegen for Trunc() of timestamp takes far too long
                 Key: IMPALA-5437
                 URL: https://issues.apache.org/jira/browse/IMPALA-5437
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0
            Reporter: Tim Armstrong
            Assignee: Tim Armstrong


The following query takes longer to run than it should because of codegen time.

{code}
use functional_parquet;
   SELECT
   t_9.c_5,
   MIN(t_9.c_6),
   MIN(t_9.c_6),
   MIN(t_9.c_6)
FROM
(
      SELECT
         at1.string_col c_5,
         SUM(at1.int_col) c_6
      FROM
         alltypes at1
      JOIN
      (
         SELECT
            t_4.c_1 cot1,
            MIN(t_4.c_2) imp2
         FROM
         (
               SELECT
                  at2.string_col c_1,
                  SUM(at2.bigint_col) c_2
               FROM
                  alltypesagg at2
               GROUP BY
                  1
         ) t_4
         GROUP BY
            1
         ORDER BY
            2 DESC NULLS LAST
         LIMIT 100
      ) t_8
      ON
         t_8.cot1 = at1.string_col
      WHERE
         CASE WHEN    (DATEDIFF(TO_DATE(TRUNC(at1.timestamp_col, 'MM')), TO_DATE(TRUNC(TRUNC(at1.timestamp_col,
'MM'), 'YY'))) + 1) >= 152
AND
   YEAR(at1.timestamp_col) >= 2017 THEN 0 ELSE 1 END = 1
      AND
         at1.timestamp_col >= CAST('2014-01-01 00:00:00' AS TIMESTAMP)
      AND
         CASE WHEN TRUNC(at1.timestamp_col, 'MM') = CAST(CAST('2014-01-01' AS TIMESTAMP) AS
TIMESTAMP) THEN at1.timestamp_col ELSE CAST(NULL AS TIMESTAMP) END >= CAST('2014-01-01
00:00:00' AS TIMESTAMP)
      GROUP BY
         1
) t_9
GROUP BY 1;
{code}

The size of the codegen module is excessive compared to the complexity of the query:
{code}
        CodeGen:(Total: 1s402ms, non-child: 1s402ms, % non-child: 100.00%)
           - CodegenTime: 6.117ms
           - CompileTime: 465.399ms
           - LoadTime: 0.000ns
           - ModuleBitcodeSize: 1.98 MB (2077396)
           - NumFunctions: 194 (194)
           - NumInstructions: 4.01K (4011)
           - OptimizationTime: 890.290ms
           - PeakMemoryUsage: 1.96 MB (2053632)
           - PrepareTime: 21.772ms
{code}

I dug into what was happening and the module is full of boost timestamp functions. It looks
like most of this is coming from our implementation of Trunc(), which switches between a large
number of different implementations.

I moved Trunc() out of the cross-compiled IR and codegen time was more sensible:
{code}
        CodeGen:(Total: 357.087ms, non-child: 357.087ms, % non-child: 100.00%)
           - CodegenTime: 2.748ms
           - CompileTime: 83.765ms
           - LoadTime: 0.000ns
           - ModuleBitcodeSize: 1.97 MB (2066128)
           - NumFunctions: 149 (149)
           - NumInstructions: 2.22K (2218)
           - OptimizationTime: 245.623ms
           - PeakMemoryUsage: 1.08 MB (1135616)
           - PrepareTime: 20.145ms
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message