Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 15 Aug 2014 22:15:19 +0000 (UTC)
From: "Sushanth Sowmyan (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12641836.1365600359119.12038.1408140919207@arcas>
In-Reply-To: <JIRA.12641836.1365600359119@arcas>
References: <JIRA.12641836.1365600359119@arcas>
Subject: [jira] [Commented] (HIVE-4329) HCatalog should use
 getHiveRecordWriter rather than getRecordWriter
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099257#comment-14099257 ] 

Sushanth Sowmyan commented on HIVE-4329:
----------------------------------------

Hi,

I'm against the goal of this patch requirement altogether, and this patch effectively breaks one of the core reasons for the existence of HCatalog, to be a generic wrapper for underlying mapreduce IF/OFs, for consumers that expect mapreduce IF/OFs. I apologize for not having spotted this jira earlier, since it seems a lot of work has gone into this, and I understand that there is an impedance mismatch here between HiveOutputFormat and OutputFormat, and one we want to fix, but this fix is in the opposite direction of the desired way of solving that impedance mismatch.

One of the longer term goals, for us, has been to try to evolve Hive's usage of StorageHandlers to a point where Hive stops using HiveRecordWriter/HiveOutputFormat altogether, so that there is no notion of an "internal" and "external" OutputFormat definition, so that third party mapreduce IF/OFs can directly be integrated into Hive, instead of having to change them to HiveOutputFormat/etc.

The primary issue discussed in this problem, that of FileRecordWriterContainer writing out a NullComparable is something that's solvable, since FileRecordWritableContainer's key format is a WritableComparable, and if AvroContainerOutputFormat does not already care about the key anyway, we should be ignoring it. If it's simpler, I would also be in favour of a hack like the FileRecordWriterContainer emiting a LongWritable in that case if it detects it's wrapping an AvroContainerOutputFormat instead of rewiring HCatalog to make it based on HiveOutputFormat.

> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>
>                 Key: HIVE-4329
>                 URL: https://issues.apache.org/jira/browse/HIVE-4329
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause impacts all non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>         Attachments: HIVE-4329.0.patch
>
>
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails with the following stacktrace:
> {code}
> java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.io.LongWritable
> 	at org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat$1.write(AvroContainerOutputFormat.java:84)
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:253)
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:53)
> 	at org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:242)
> 	at org.apache.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:52)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> 	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:559)
> 	at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's signature mandates a LongWritable key and HCat's FileRecordWriterContainer forces a NullWritable. I'm not sure of a general fix, other than redefining HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive OutputFormats, and there's no reason AvroContainerOutputFormat couldn't also be changed, since it's ignoring the key. That way fixing things so FileRecordWriterContainer can always use NullWritable could get spun into a different issue?
> The underlying cause for failure to write to AvroSerde tables is that AvroContainerOutputFormat doesn't meaningfully implement getRecordWriter, so fixing the above will just push the failure into the placeholder RecordWriter.


--
This message was sent by Atlassian JIRA
(v6.2#6252)