impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Armstrong <tarmstr...@cloudera.com>
Subject Re: Bug with complex types
Date Wed, 28 Dec 2016 15:31:55 GMT
I notice that " e.date_id = 2420" doesn't appear anywhere in the explain
plan - was the explain plan just for a slightly different query?

It does look like IMPALA-4049 but the Impala server that ships with CDH
5.8.3 should have the fix:
https://github.com/cloudera/Impala/commits/cdh5-2.6.0_5.8.3. It sounds like
that's what you're running.

It could be a parquet bug like
https://issues.cloudera.org/browse/IMPALA-4539 or
https://issues.cloudera.org/browse/IMPALA-4444.

I think we really need to just try and reproduce it - would you be able to
file an issue on https://issues.cloudera.org?


On Tue, Dec 27, 2016 at 10:35 PM, Dejan Prokić <dejanp@nordeus.com> wrote:

> Hi Tim,
>
> Thanks for the quick response!
>
> I am running version 2.6.0, more precisely:
> Server version: impalad version 2.6.0-cdh5.8.3 RELEASE (build
> c644f476b774db9db87a619628f7a6ecc5f843e0)
>
> Here is explain plan:
> +-----------------------------------------------------------
> -------------------------+
> | Explain String
> |
> +-----------------------------------------------------------
> -------------------------+
> | Estimated Per-Host Requirements: Memory=882.00MB
> VCores=2                          |
> | WARNING: The following tables are missing relevant table and/or column
> statistics. |
> | dejanp.event
> |
> |
> |
> | 09:EXCHANGE [UNPARTITIONED]
> |
> | |
> |
> | 08:AGGREGATE [FINALIZE]
>                                         |
> | |  output: count:merge(*)
> |
> | |  group by: won
>                                         |
> | |
> |
> | 07:EXCHANGE [HASH(won)]
> |
> | |
> |
> | 06:AGGREGATE [STREAMING]
>                                         |
> | |  output: count(*)
> |
> | |  group by: max(CASE WHEN key = 'won' THEN value
> END)                             |
> | |
> |
> | 01:SUBPLAN
> |
> | |
> |
> | |--05:NESTED LOOP JOIN [CROSS JOIN]
> |
> | |  |
> |
> | |  |--02:SINGULAR ROW SRC
> |
> | |  |
> |
> | |  04:AGGREGATE [FINALIZE]
>                                      |
> | |  |  output: max(CASE WHEN key = 'won' THEN value
> END)                            |
> | |  |
> |
> | |  03:UNNEST [e.event_map]
>                                         |
> | |
> |
> | 00:SCAN HDFS [dejanp.event e]
> |
> |    partitions=1/1 files=64 size=18.76GB
> |
> |    predicates: e.event_id = 10012
> |
> +-----------------------------------------------------------
> -------------------------+
>
> Looks very similar to https://issues.cloudera.org/browse/IMPALA-4049
> Looking at the list of bug fixes, I should upgrade Impala to the latest
> version.
>
> Dejan
>
>
> *Dejan Prokić* | Data Engineer | Nordeus
>
> 2016-12-27 17:18 GMT+01:00 Tim Armstrong <tarmstrong@cloudera.com>:
>
>> Hi Dejan,
>>   Thanks for the bug report!
>>
>> There are a few issues related to nested types that you could be hitting.
>> It depends on the versions of Impala - we fixed different cases in
>> different versions. What version are you running? Also if you have the
>> "explain" plan available that would help in narrowing down the possible
>> cases.
>>
>> I did a quick search that I think pulls up all the relevant JIRAs:
>> https://issues.cloudera.org/issues/?jql=project%20%3D%20IMPA
>> LA%20AND%20labels%20%3D%20correctness%20AND%20text%20~%
>> 20nested%20ORDER%20BY%20updated%20DESC
>>
>> Could be https://issues.cloudera.org/browse/IMPALA-4049 or
>> https://issues.cloudera.org/browse/IMPALA-3311
>>
>> - Tim
>>
>> On Tue, Dec 27, 2016 at 1:58 AM, Dejan Prokić <dejanp@nordeus.com> wrote:
>>
>>> Hi,
>>>
>>> I noticed a bug when selecting data from a table with complex type. This
>>> is definition of my table:
>>>
>>> CREATE TABLE event (
>>>   user_id BIGINT,
>>>   event_id INT,
>>>   event_type_id INT,
>>>   ts_bigint BIGINT,
>>>   processed_ts_bigint BIGINT,
>>>   event_map MAP<STRING, STRING>,
>>>   server_id INT
>>> )
>>> PARTITIONED BY (date_id INT)
>>> STORED AS PARQUET;
>>>
>>> I want to run this query:
>>>
>>> WITH
>>> battles AS (
>>> SELECT
>>>   *
>>> FROM event e,
>>>   (SELECT
>>>     MAX(CASE WHEN key='won' THEN value END) AS won
>>>   FROM e.event_map) m
>>> WHERE e.date_id = 2420 AND e.event_id = 10012
>>> )
>>> SELECT won, count(*) FROM battles
>>> GROUP BY won;
>>>
>>> Map element with key = 'won' has only 'true' and 'false' values. If
>>> battles subquery has more than 5 million rows the query returns strange
>>> results, and when it has less rows query returns correct data.
>>>
>>> Are you aware of this issue? I didn't send you my data since it is
>>> pretty large for mail (20 GB).
>>>
>>> Thanks
>>>
>>> *Dejan Prokić* | Data Engineer | Nordeus
>>>
>>
>>
>

Mime
View raw message