drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Hou (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (DRILL-6276) Drill CTAS creates parquet file having page greater than 200 MB.
Date Mon, 19 Mar 2018 20:58:00 GMT

     [ https://issues.apache.org/jira/browse/DRILL-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Hou reassigned DRILL-6276:
---------------------------------

    Assignee: Pritesh Maker

> Drill CTAS creates parquet file having page greater than 200 MB.
> ----------------------------------------------------------------
>
>                 Key: DRILL-6276
>                 URL: https://issues.apache.org/jira/browse/DRILL-6276
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.13.0
>            Reporter: Robert Hou
>            Assignee: Pritesh Maker
>            Priority: Major
>         Attachments: alltypes_asc_16MB.json
>
>
> I used this CTAS to create a parquet file from a json file:
> {noformat}
> create table `alltypes.parquet` as select cast(BigIntValue as BigInt) BigIntValue, cast(BooleanValue
as Boolean) BooleanValue, cast (DateValue as Date) DateValue, cast (FloatValue as Float) FloatValue,
cast (DoubleValue as Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast
(TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp) TimestampValue, cast (IntervalYearValue
as INTERVAL YEAR) IntervalYearValue, cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue,
cast (IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast (BinaryValue as binary)
Binaryvalue, cast (VarcharValue as varchar) VarcharValue from `alltypes.json`;
> {noformat}
> I ran parquet-tools/parquet-dump :
>     VarcharValue TV=6885 RL=0 DL=1
>     ------------------------------------------------------------------------------------------------
>     page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885
> The page size is 16MB.  This is with a 16MB data set.  When I try a similar 1GB data
set, the page size starts at over 200 MB, decreasing down to 1MB.
>     VarcharValue TV=208513 RL=0 DL=1
>     ------------------------------------------------------------------------------------------------
>     page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
>     page 1:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
>     page 2:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
>     page 3:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
>     page 4:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
>     page 5:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
>     page 6:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
>     page 7:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
>     page 8:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
>     page 9:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
>     page 10:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
>     page 11:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
>     page 12:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
>     page 13:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
>     page 14:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
>     page 15:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268
> The column has a varchar, and the size varies from 2 bytes to 5000 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message