drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Challapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5472) Parquet reader generating low-density batches causing Sort operator to spill un-necessarily
Date Thu, 04 May 2017 17:01:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997075#comment-15997075

Rahul Challapalli commented on DRILL-5472:

The parquet file is generated using either constants or missing fields
create table drill5472 as 
      d.map map,
      d.map.missing1 missing1, 
      'hello' as missing2, 
      true as missing3, 
      5.888 as missing4, 
      cast('abcd' as varchar) missing5, 
      cast('1998-01-01' as date) missing6, 
      cast(1.1 as decimal(28,2)) missing7, 
      CAST(456 as CHAR(3)) missing8, 
      cast('P1Y' as interval year) missing9, 
      cast('P1D' as interval day) missing10,
      cast('P1Y1M1DT1H1M' as interval second) missing11,
      CONVERT_FROM('{x:100, y:215.6}' ,'JSON') missing12,
      STRING_BINARY(CONVERT_TO(1, 'INT')) missing13,
      STRING_BINARY(CONVERT_TO(1, 'INT_BE')) as missing14,
      STRING_BINARY(CONVERT_TO(1, 'BIGINT')) as missing15,
      STRING_BINARY(CONVERT_TO(1, 'BIGINT')) as missing16,
      STRING_BINARY(CONVERT_TO(1, 'INT_HADOOPV')) as missing17,
      STRING_BINARY(CONVERT_TO('hello', 'UTF8')) as missing18,
      STRING_BINARY(CONVERT_TO('hello', 'UTF16')) missing19,
      CONVERT_FROM(BINARY_STRING('\x00\x00\x00\xC8'), 'INT_BE') AS missing20,
      CONVERT_FROM(BINARY_STRING('\x00\x00\x00\xC8'), 'INT') AS missing21,
      CONVERT_FROM(BINARY_STRING('\xBE\xBA\xFE\xCA'), 'INT_BE') AS missing22,
      CONVERT_TO(-1095041334, 'INT_BE') as missing23,
      TO_CHAR(1256.789383, '#,###.###') missing24,
      TO_CHAR((CAST('2008-2-23' AS DATE)), 'yyyy-MMM-dd') missing25,
      CAST('12:20:30' AS TIME) missing26,
      CAST('2015-2-23 12:00:00' AS TIMESTAMP) missing27,
      TO_DATE('2015-FEB-23', 'yyyy-MMM-dd') missing28,
      EXTRACT(year from mydate) `missing 29`,
      TO_DATE(1427849046000) missing30,
      TO_NUMBER('987,966', '######') missing31,
      TO_TIME('12:20:30', 'HH:mm:ss') missing32,
      TO_TIMESTAMP('2008-2-23 12:00:00', 'yyyy-MM-dd HH:mm:ss') missing33,
      TIMEOFDAY() missing34,
      d.map.missingmap.m1 m1 
    from dfs.`/drill/testdata/resource-manager/nested-large.json` d;

> Parquet reader generating low-density batches causing Sort operator to spill un-necessarily
> -------------------------------------------------------------------------------------------
>                 Key: DRILL-5472
>                 URL: https://issues.apache.org/jira/browse/DRILL-5472
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Storage - Parquet
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: drill5472.log, drill5472.parquet, drill5472.sys.drill
> git.commit.id.abbrev=1e0a14c
> The parquet file used in the below query is ~20MB. The uncompressed size id ~1.2 GB.
Now the below query has a sort which is given ~6GB memory for a single fragment and yet it
> {code}
> select * from (select * from dfs.`/drill/testdata/resource-manager/all_types_large` s
order by s.missing12.x) d where d.missing3 is false;
> {code}
> The profile indicates that the above query has spilled twice. Attached the profile and
the logs

This message was sent by Atlassian JIRA

View raw message