Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 72AC017556 for ; Tue, 11 Nov 2014 21:08:44 +0000 (UTC) Received: (qmail 23121 invoked by uid 500); 11 Nov 2014 21:08:43 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 23053 invoked by uid 500); 11 Nov 2014 21:08:43 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 22879 invoked by uid 99); 11 Nov 2014 21:08:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Nov 2014 21:08:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.212.169] (HELO mail-wi0-f169.google.com) (209.85.212.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Nov 2014 21:08:17 +0000 Received: by mail-wi0-f169.google.com with SMTP id n3so3048305wiv.2 for ; Tue, 11 Nov 2014 13:07:32 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=yOAa+Xja5gK98JthRjcy3wTNujZFDJyRUIqZ7vtgMxA=; b=cfp4J0QxBGlZra07zNBQDq/uN5p6ckFFikrNeSyQJ/qsvAe0FTghTSznptBOF90pfg 6bJR5/btwJV3W4+uHGkSV5wgk0HTDQhCaWa8xWO/16iABkzNEOnKJvl0WIsNre6yrHay NTd7y9XjAM5bF+RXrsMFtkyguo/9KJ1/vJvZMIobaj/0tVNLU9oEdg2Y0w2FgagmDiJT Vj3A5vhII4QNT0Di4AT7KItO81VrkeqigJx/At/bOecz7lenGrwi3mrNbk1Ct9TwW/f7 wgHTyuQqqZ+68hgYuPgbwPax4XvbYOma2UvG0nDAm6d/zULeRtmye5srsvmy9CDkC92/ EZ1g== X-Gm-Message-State: ALoCoQllmUnOJWwW4JcgnGBE/S/eqS3DWFLIUX8sMU70AJL8CV33Hzw2x5tmw5gQNF9T4EclaWrf MIME-Version: 1.0 X-Received: by 10.180.86.198 with SMTP id r6mr44449297wiz.29.1415740052099; Tue, 11 Nov 2014 13:07:32 -0800 (PST) Received: by 10.194.107.193 with HTTP; Tue, 11 Nov 2014 13:07:32 -0800 (PST) Date: Tue, 11 Nov 2014 13:07:32 -0800 Message-ID: Subject: Hive Parquet Reader and "repeated" field From: Jean-Pascal Billaud To: dev@hive.apache.org, dev@parquet.incubator.apache.org Content-Type: multipart/alternative; boundary=f46d04428c08e66f8505079baa93 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04428c08e66f8505079baa93 Content-Type: text/plain; charset=UTF-8 Hi, I am trying to integrate parquet as the underlying storage format in our data pipeline but I am facing some issues which I hope some of you can help me with. The batch layer is fairly standard, some cascading write thrift log objecs from an input tap to a parquet output sink. As a snippet of one of the thrift structure serialized: struct RequestInfo { 1: optional string status, 2: optional list requests, } struct RequestDetails { 1: optional string type, 2: optional bool valid, } Looking at the cascading Parquet writer, this translates into this: optional binary status (UTF8); optional group requests (LIST) { repeated group requests_tuple { optional binary type (UTF8); optional boolean valid; } } Then I have a hive table that points to the parquet file while specifying the thrift class serialized. CREATE EXTERNAL TABLE parquet_requests ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs_somewhere' TBLPROPERTIES ( 'thrift.class' = 'RequestInfo' ); While running "select * from parquet_requests", the whole thing crashes with the following exception: > public ArrayWritableGroupConverter(final GroupType groupType, final HiveGroupConverter parent, > final int index) { > this.parent = parent; > this.index = index; > int count = groupType.getFieldCount(); > if (count < 1 || count > 2) { > throw new IllegalStateException("Field count must be either 1 or 2: " + count); > } > What this means is that requests_tuple is not considered a valid list because it has more than one field. It basically expects the "repeated" keyword on the "requests (LIST)" as opposed to "requests_tuple". The actual code also does not seem to handle repeated on primitives since the ETypeConverters always call parent.set() hence always replacing the previous stored instance. I cooked up a patch which as far as I can tell would fix the issues here and I would like to have some comments to see if that patch is in the right direction before submitting a more formal pull request. Things need to be polished so please don't spend too much time on the form but more on the approach. https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d Moreover, I have a feeling that I should probably not pass the thrift class for the parquet table given that at this point it is totally irrelevant and the parquet schema is stored in the parquet files. I also expect some ObjectInspector issue due to the extra grouping provided by the requests_tuple entry. Thoughts? Thanks, --f46d04428c08e66f8505079baa93--