drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-389) Nested Parquet data generated from Hive does not work
Date Mon, 09 Jun 2014 16:45:06 GMT

     [ https://issues.apache.org/jira/browse/DRILL-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jacques Nadeau updated DRILL-389:
---------------------------------

    Priority: Critical  (was: Major)

> Nested Parquet data generated from Hive does not work
> -----------------------------------------------------
>
>                 Key: DRILL-389
>                 URL: https://issues.apache.org/jira/browse/DRILL-389
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.0.0-milestone-1
>         Environment: CentOS 6.3
> CDH 4.6 installed by Cloudera Manager Free Edition
> Hive 0.10.0
>            Reporter: Thaddeus Diamond
>            Assignee: Jason Altekruse
>            Priority: Critical
>         Attachments: avro_test.db, nobench.ddl, nobench_1.avsc, parquet-nobench_0.parquet
>
>
> Inside of Hive, I generated Parquet data from Avro data as follows.  Using the attached
Avro file (avro_test.db) and the attached nested Avro schema (nobench_1.avsc), I created a
Hive table:
> {noformat}
> CREATE TABLE avro_nobench_hdfs
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION 'hdfs:///user/hdfs/avro'
> TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hdfs/nobench.avsc');
> {noformat}
> Note that this schema is based loosely off of the NoBench standard proposed by Craig
Chasseur for JSON (http://pages.cs.wisc.edu/~chasseur/).
> In order to create a Parquet Hive table you need to create a full schema.  The one attached
is very large, so I used the following:
> {noformat}
> sudo -u hdfs hive -e 'describe avro_nobench_hdfs' > /tmp/temp.sql
> {noformat}
> Then, I replaced the "from deserializer" with commas and added the following SQL DDL
around it:
> {noformat}
> CREATE TABLE avro_nobench_parquet (
>     // ... COLUMNS HERE
> )
> ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
> STORED AS
> INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
> OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
> {noformat}
> Finally, I generated the actual Parquet binary data using {{INSERT INTO}}:
> {noformat}
> INSERT OVERWRITE avro_nobench_parquet SELECT * FROM avro_nobench_hdfs;
> {noformat}
> This successfully completed.  Then, the data was validated using:
> {noformat}
> SELECT COUNT(*) FROM avro_nobench_parquet;
> SELECT * FROM avro_nobench_parquet LIMIT 1;
> {noformat}
> If you look in {{hdfs:///user/hive/warehouse/avro_nobench_parquet}} you'll see a single
raw file (something like {{0000_0}}).  Download that to local:
> {noformat}
> sudo -u hdfs hdfs dfs -copyToLocal /user/hive/warehouse/avro_nobench_parquet/* .
> {noformat}
> Then, in DRILL I ran:
> {noformat}
> SELECT COUNT(*) FROM "nobench.parquet";
> {noformat}
> And got the following:
> {noformat}
> Caused by: org.apache.drill.exec.rpc.RpcException: Remote failure while running query.[error_id:
"a13783d0-d9da-4639-8809-ba4a5ac54e04"
> endpoint {
>   address: "ip-10-101-1-82.ec2.internal"
>   user_port: 31010
>   bit_port: 32011
> }
> error_type: 0
> message: "Failure while running fragment. < NullPointerException"
> ]
>         at org.apache.drill.exec.rpc.user.QueryResultHandler.batchArrived(QueryResultHandler.java:72)
>         at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:79)
>         at org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:48)
>         at org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:33)
>         at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:142)
>         at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:127)
>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>         at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334)
>         at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320)
>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>         at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334)
>         at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320)
>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:173)
>         at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334)
>         at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320)
>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:100)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:497)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:465)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:359)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The second time I run it I get an OOM:
> {noformat}
> Exception in thread "WorkManager-3" java.lang.OutOfMemoryError: Java heap space
>         at org.apache.drill.exec.store.parquet.PageReadStatus.<init>(PageReadStatus.java:41)
>         at org.apache.drill.exec.store.parquet.ColumnReader.<init>(ColumnReader.java:70)
>         at org.apache.drill.exec.store.parquet.VarLenBinaryReader$NullableVarLengthColumn.<init>(VarLenBinaryReader.java:62)
>         at org.apache.drill.exec.store.parquet.ParquetRecordReader.<init>(ParquetRecordReader.java:167)
>         at org.apache.drill.exec.store.parquet.ParquetRecordReader.<init>(ParquetRecordReader.java:99)
>         at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:60)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:103)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:63)
>         at org.apache.drill.exec.store.parquet.ParquetRowGroupScan.accept(ParquetRowGroupScan.java:107)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:90)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:63)
>         at org.apache.drill.exec.physical.config.Project.accept(Project.java:51)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:121)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:63)
>         at org.apache.drill.exec.physical.config.Sort.accept(Sort.java:58)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:151)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:63)
>         at org.apache.drill.exec.physical.config.StreamingAggregate.accept(StreamingAggregate.java:59)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:132)
>         at org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:63)
>         at org.apache.drill.exec.physical.config.Screen.accept(Screen.java:102)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:180)
>         at org.apache.drill.exec.work.foreman.RunningFragmentManager.runFragments(RunningFragmentManager.java:84)
>         at org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan(Foreman.java:228)
>         at org.apache.drill.exec.work.foreman.Foreman.parseAndRunLogicalPlan(Foreman.java:176)
>         at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:153)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message