Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91F761130A for ; Sun, 10 Aug 2014 22:08:47 +0000 (UTC) Received: (qmail 92941 invoked by uid 500); 10 Aug 2014 22:08:47 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 92910 invoked by uid 500); 10 Aug 2014 22:08:47 -0000 Mailing-List: contact issues-help@drill.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.incubator.apache.org Delivered-To: mailing list issues@drill.incubator.apache.org Received: (qmail 92901 invoked by uid 99); 10 Aug 2014 22:08:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Aug 2014 22:08:47 +0000 X-ASF-Spam-Status: No, hits=-2000.7 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 10 Aug 2014 22:08:24 +0000 Received: (qmail 89503 invoked by uid 99); 10 Aug 2014 22:08:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Aug 2014 22:08:21 +0000 Date: Sun, 10 Aug 2014 22:08:21 +0000 (UTC) From: "Jacques Nadeau (JIRA)" To: issues@drill.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (DRILL-389) Nested Parquet data generated from Hive does not work MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/DRILL-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacques Nadeau resolved DRILL-389. ---------------------------------- Resolution: Fixed > Nested Parquet data generated from Hive does not work > ----------------------------------------------------- > > Key: DRILL-389 > URL: https://issues.apache.org/jira/browse/DRILL-389 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: m1 > Environment: CentOS 6.3 > CDH 4.6 installed by Cloudera Manager Free Edition > Hive 0.10.0 > Reporter: Thaddeus Diamond > Assignee: Jason Altekruse > Priority: Critical > Fix For: 0.5.0 > > Attachments: avro_test.db, nobench.ddl, nobench_1.avsc, parquet-nobench_0.parquet > > > Inside of Hive, I generated Parquet data from Avro data as follows. Using the attached Avro file (avro_test.db) and the attached nested Avro schema (nobench_1.avsc), I created a Hive table: > {noformat} > CREATE TABLE avro_nobench_hdfs > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 'hdfs:///user/hdfs/avro' > TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hdfs/nobench.avsc'); > {noformat} > Note that this schema is based loosely off of the NoBench standard proposed by Craig Chasseur for JSON (http://pages.cs.wisc.edu/~chasseur/). > In order to create a Parquet Hive table you need to create a full schema. The one attached is very large, so I used the following: > {noformat} > sudo -u hdfs hive -e 'describe avro_nobench_hdfs' > /tmp/temp.sql > {noformat} > Then, I replaced the "from deserializer" with commas and added the following SQL DDL around it: > {noformat} > CREATE TABLE avro_nobench_parquet ( > // ... COLUMNS HERE > ) > ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' > STORED AS > INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat" > OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"; > {noformat} > Finally, I generated the actual Parquet binary data using {{INSERT INTO}}: > {noformat} > INSERT OVERWRITE avro_nobench_parquet SELECT * FROM avro_nobench_hdfs; > {noformat} > This successfully completed. Then, the data was validated using: > {noformat} > SELECT COUNT(*) FROM avro_nobench_parquet; > SELECT * FROM avro_nobench_parquet LIMIT 1; > {noformat} > If you look in {{hdfs:///user/hive/warehouse/avro_nobench_parquet}} you'll see a single raw file (something like {{0000_0}}). Download that to local: > {noformat} > sudo -u hdfs hdfs dfs -copyToLocal /user/hive/warehouse/avro_nobench_parquet/* . > {noformat} > Then, in DRILL I ran: > {noformat} > SELECT COUNT(*) FROM "nobench.parquet"; > {noformat} > And got the following: > {noformat} > Caused by: org.apache.drill.exec.rpc.RpcException: Remote failure while running query.[error_id: "a13783d0-d9da-4639-8809-ba4a5ac54e04" > endpoint { > address: "ip-10-101-1-82.ec2.internal" > user_port: 31010 > bit_port: 32011 > } > error_type: 0 > message: "Failure while running fragment. < NullPointerException" > ] > at org.apache.drill.exec.rpc.user.QueryResultHandler.batchArrived(QueryResultHandler.java:72) > at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:79) > at org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:48) > at org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:33) > at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:142) > at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:127) > at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) > at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334) > at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320) > at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334) > at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320) > at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:173) > at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334) > at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320) > at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:100) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:465) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:359) > at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:744) > {noformat} > The second time I run it I get an OOM: > {noformat} > Exception in thread "WorkManager-3" java.lang.OutOfMemoryError: Java heap space > at org.apache.drill.exec.store.parquet.PageReadStatus.(PageReadStatus.java:41) > at org.apache.drill.exec.store.parquet.ColumnReader.(ColumnReader.java:70) > at org.apache.drill.exec.store.parquet.VarLenBinaryReader$NullableVarLengthColumn.(VarLenBinaryReader.java:62) > at org.apache.drill.exec.store.parquet.ParquetRecordReader.(ParquetRecordReader.java:167) > at org.apache.drill.exec.store.parquet.ParquetRecordReader.(ParquetRecordReader.java:99) > at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:60) > at org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:103) > at org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:63) > at org.apache.drill.exec.store.parquet.ParquetRowGroupScan.accept(ParquetRowGroupScan.java:107) > at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:90) > at org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:63) > at org.apache.drill.exec.physical.config.Project.accept(Project.java:51) > at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:121) > at org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:63) > at org.apache.drill.exec.physical.config.Sort.accept(Sort.java:58) > at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:151) > at org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:63) > at org.apache.drill.exec.physical.config.StreamingAggregate.accept(StreamingAggregate.java:59) > at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:132) > at org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:63) > at org.apache.drill.exec.physical.config.Screen.accept(Screen.java:102) > at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:180) > at org.apache.drill.exec.work.foreman.RunningFragmentManager.runFragments(RunningFragmentManager.java:84) > at org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan(Foreman.java:228) > at org.apache.drill.exec.work.foreman.Foreman.parseAndRunLogicalPlan(Foreman.java:176) > at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:153) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)