Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Thu, 28 Aug 2014 22:51:08 +0000 (UTC)
From: "David Chen (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12641836.1365600359119.4919.1409266268850@arcas>
In-Reply-To: <JIRA.12641836.1365600359119@arcas>
References: <JIRA.12641836.1365600359119@arcas>
Subject: [jira] [Commented] (HIVE-4329) HCatalog should use
 getHiveRecordWriter rather than getRecordWriter
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114530#comment-14114530 ] 

David Chen commented on HIVE-4329:
----------------------------------

Hi Sushanth,

I really appreciate you taking your time to look at this patch and for your tips. However, I am still a bit unclear about some of the concerns you mentioned.

bq. Unfortunately, this will not work, because that simply fetches a substitute HiveOutputFormat from a map of substitutes, which contain substitutes for only IgnoreKeyTextOutputFormat and SequenceFileOutputFormat.

>From my understanding, {{HivePassThroughOutputFormat}} was introduced in order to support generic OutputFormats and not just {{HiveOutputFormat}}. According to {{[HiveFileFormatUtils. getOutputFormatSubstitute|https://github.com/apache/hive/blob/b8250ac2f30539f6b23ce80a20a9e338d3d31458/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]}}, {{HivePassThroughOutputFormat}} is returned if the {{OutputFormat}} does not exist in the map but only if it is called with {{storageHandlerFlag = true}}. From [searching the codebase|https://github.com/apache/hive/search?utf8=%E2%9C%93&q=getOutputFormatSubstitute&type=Code], the only place where {{getOutputFormatSubstitute}} could be called with {{storageHandlerFlag}} set to true is in {{Table.getOutputFormatClass}} and if the {{storage_handler}} property is set.

As a result, I changed my patch to retrieve the {{OutputFormat}} class using {{Table.getOutputFormatClass}} so that HCatalog would follow the same codepath as Hive proper for getting the {{OutputFormat}}. Does this address your concern?

bq. If your patch were so that it fetches an underlying HiveOutputFormat, and if it were a HiveOutputFormat, using getHiveRecordWriter, and if it were not, using getRecordWriter, that solution would not break runtime backward compatibility, and would be acceptable

I tried this approach, but I think that it is cleaner to change {{OutputFormatContainer}} and {{RecordWriterContainer}} to wrap the Hive implementations ({{HiveOutputFormat}} and {{FileSinkOperator.RecordWriter}}) rather than introduce yet another set of wrappers. After all, Hive already has a mechanism for supporting both Hive OFs and MR OFs by wrapping MR OFs with {{HivePassThroughOutputFormat}}, and I think that HCatalog should evolve to share more common infrastructure with Hive.

I have attached a new revision of my patch that now fixes the original reason why this ticket is opened; writing to an Avro table via HCatalog now works. There are still a few remaining issues though:

 * The way that tables with static partitioning is handled is not completely correct. I have opened HIVE-7855 to address that issue.
 * Writing to a Parquet table does not work but more investigation is needed to determine whether this is caused by a bug in HCatalog or in the Parquet SerDe.

> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>
>                 Key: HIVE-4329
>                 URL: https://issues.apache.org/jira/browse/HIVE-4329
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause impacts all non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>         Attachments: HIVE-4329.0.patch, HIVE-4329.1.patch, HIVE-4329.2.patch
>
>
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails with the following stacktrace:
> {code}
> java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.io.LongWritable
> 	at org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat$1.write(AvroContainerOutputFormat.java:84)
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:253)
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:53)
> 	at org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:242)
> 	at org.apache.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:52)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> 	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:559)
> 	at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's signature mandates a LongWritable key and HCat's FileRecordWriterContainer forces a NullWritable. I'm not sure of a general fix, other than redefining HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive OutputFormats, and there's no reason AvroContainerOutputFormat couldn't also be changed, since it's ignoring the key. That way fixing things so FileRecordWriterContainer can always use NullWritable could get spun into a different issue?
> The underlying cause for failure to write to AvroSerde tables is that AvroContainerOutputFormat doesn't meaningfully implement getRecordWriter, so fixing the above will just push the failure into the placeholder RecordWriter.


--
This message was sent by Atlassian JIRA
(v6.2#6252)