Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CEBBA116B0 for ; Thu, 28 Aug 2014 22:51:10 +0000 (UTC) Received: (qmail 9189 invoked by uid 500); 28 Aug 2014 22:51:09 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 9090 invoked by uid 500); 28 Aug 2014 22:51:09 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 8916 invoked by uid 500); 28 Aug 2014 22:51:09 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 8766 invoked by uid 99); 28 Aug 2014 22:51:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Aug 2014 22:51:08 +0000 Date: Thu, 28 Aug 2014 22:51:08 +0000 (UTC) From: "David Chen (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114530#comment-14114530 ] David Chen commented on HIVE-4329: ---------------------------------- Hi Sushanth, I really appreciate you taking your time to look at this patch and for your tips. However, I am still a bit unclear about some of the concerns you mentioned. bq. Unfortunately, this will not work, because that simply fetches a substitute HiveOutputFormat from a map of substitutes, which contain substitutes for only IgnoreKeyTextOutputFormat and SequenceFileOutputFormat. >From my understanding, {{HivePassThroughOutputFormat}} was introduced in order to support generic OutputFormats and not just {{HiveOutputFormat}}. According to {{[HiveFileFormatUtils. getOutputFormatSubstitute|https://github.com/apache/hive/blob/b8250ac2f30539f6b23ce80a20a9e338d3d31458/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]}}, {{HivePassThroughOutputFormat}} is returned if the {{OutputFormat}} does not exist in the map but only if it is called with {{storageHandlerFlag = true}}. From [searching the codebase|https://github.com/apache/hive/search?utf8=%E2%9C%93&q=getOutputFormatSubstitute&type=Code], the only place where {{getOutputFormatSubstitute}} could be called with {{storageHandlerFlag}} set to true is in {{Table.getOutputFormatClass}} and if the {{storage_handler}} property is set. As a result, I changed my patch to retrieve the {{OutputFormat}} class using {{Table.getOutputFormatClass}} so that HCatalog would follow the same codepath as Hive proper for getting the {{OutputFormat}}. Does this address your concern? bq. If your patch were so that it fetches an underlying HiveOutputFormat, and if it were a HiveOutputFormat, using getHiveRecordWriter, and if it were not, using getRecordWriter, that solution would not break runtime backward compatibility, and would be acceptable I tried this approach, but I think that it is cleaner to change {{OutputFormatContainer}} and {{RecordWriterContainer}} to wrap the Hive implementations ({{HiveOutputFormat}} and {{FileSinkOperator.RecordWriter}}) rather than introduce yet another set of wrappers. After all, Hive already has a mechanism for supporting both Hive OFs and MR OFs by wrapping MR OFs with {{HivePassThroughOutputFormat}}, and I think that HCatalog should evolve to share more common infrastructure with Hive. I have attached a new revision of my patch that now fixes the original reason why this ticket is opened; writing to an Avro table via HCatalog now works. There are still a few remaining issues though: * The way that tables with static partitioning is handled is not completely correct. I have opened HIVE-7855 to address that issue. * Writing to a Parquet table does not work but more investigation is needed to determine whether this is caused by a bug in HCatalog or in the Parquet SerDe. > HCatalog should use getHiveRecordWriter rather than getRecordWriter > ------------------------------------------------------------------- > > Key: HIVE-4329 > URL: https://issues.apache.org/jira/browse/HIVE-4329 > Project: Hive > Issue Type: Bug > Components: HCatalog, Serializers/Deserializers > Affects Versions: 0.14.0 > Environment: discovered in Pig, but it looks like the root cause impacts all non-Hive users > Reporter: Sean Busbey > Assignee: David Chen > Attachments: HIVE-4329.0.patch, HIVE-4329.1.patch, HIVE-4329.2.patch > > > Attempting to write to a HCatalog defined table backed by the AvroSerde fails with the following stacktrace: > {code} > java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.io.LongWritable > at org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat$1.write(AvroContainerOutputFormat.java:84) > at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:253) > at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:53) > at org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:242) > at org.apache.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:52) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139) > at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98) > at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:559) > at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85) > {code} > The proximal cause of this failure is that the AvroContainerOutputFormat's signature mandates a LongWritable key and HCat's FileRecordWriterContainer forces a NullWritable. I'm not sure of a general fix, other than redefining HiveOutputFormat to mandate a WritableComparable. > It looks like accepting WritableComparable is what's done in the other Hive OutputFormats, and there's no reason AvroContainerOutputFormat couldn't also be changed, since it's ignoring the key. That way fixing things so FileRecordWriterContainer can always use NullWritable could get spun into a different issue? > The underlying cause for failure to write to AvroSerde tables is that AvroContainerOutputFormat doesn't meaningfully implement getRecordWriter, so fixing the above will just push the failure into the placeholder RecordWriter. -- This message was sent by Atlassian JIRA (v6.2#6252)