hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Chen (JIRA)" <>
Subject [jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter
Date Thu, 28 Aug 2014 22:51:08 GMT


David Chen commented on HIVE-4329:

Hi Sushanth,

I really appreciate you taking your time to look at this patch and for your tips. However,
I am still a bit unclear about some of the concerns you mentioned.

bq. Unfortunately, this will not work, because that simply fetches a substitute HiveOutputFormat
from a map of substitutes, which contain substitutes for only IgnoreKeyTextOutputFormat and

>From my understanding, {{HivePassThroughOutputFormat}} was introduced in order to support
generic OutputFormats and not just {{HiveOutputFormat}}. According to {{[HiveFileFormatUtils.
{{HivePassThroughOutputFormat}} is returned if the {{OutputFormat}} does not exist in the
map but only if it is called with {{storageHandlerFlag = true}}. From [searching the codebase|],
the only place where {{getOutputFormatSubstitute}} could be called with {{storageHandlerFlag}}
set to true is in {{Table.getOutputFormatClass}} and if the {{storage_handler}} property is

As a result, I changed my patch to retrieve the {{OutputFormat}} class using {{Table.getOutputFormatClass}}
so that HCatalog would follow the same codepath as Hive proper for getting the {{OutputFormat}}.
Does this address your concern?

bq. If your patch were so that it fetches an underlying HiveOutputFormat, and if it were a
HiveOutputFormat, using getHiveRecordWriter, and if it were not, using getRecordWriter, that
solution would not break runtime backward compatibility, and would be acceptable

I tried this approach, but I think that it is cleaner to change {{OutputFormatContainer}}
and {{RecordWriterContainer}} to wrap the Hive implementations ({{HiveOutputFormat}} and {{FileSinkOperator.RecordWriter}})
rather than introduce yet another set of wrappers. After all, Hive already has a mechanism
for supporting both Hive OFs and MR OFs by wrapping MR OFs with {{HivePassThroughOutputFormat}},
and I think that HCatalog should evolve to share more common infrastructure with Hive.

I have attached a new revision of my patch that now fixes the original reason why this ticket
is opened; writing to an Avro table via HCatalog now works. There are still a few remaining
issues though:

 * The way that tables with static partitioning is handled is not completely correct. I have
opened HIVE-7855 to address that issue.
 * Writing to a Parquet table does not work but more investigation is needed to determine
whether this is caused by a bug in HCatalog or in the Parquet SerDe.

> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>                 Key: HIVE-4329
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause impacts all
non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>         Attachments: HIVE-4329.0.patch, HIVE-4329.1.patch, HIVE-4329.2.patch
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails with the
following stacktrace:
> {code}
> java.lang.ClassCastException: cannot be cast to
> 	at$1.write(
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(
> 	at org.apache.hcatalog.pig.HCatBaseStorer.putNext(
> 	at org.apache.hcatalog.pig.HCatStorer.putNext(
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(
> 	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(
> 	at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's signature
mandates a LongWritable key and HCat's FileRecordWriterContainer forces a NullWritable. I'm
not sure of a general fix, other than redefining HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive OutputFormats,
and there's no reason AvroContainerOutputFormat couldn't also be changed, since it's ignoring
the key. That way fixing things so FileRecordWriterContainer can always use NullWritable could
get spun into a different issue?
> The underlying cause for failure to write to AvroSerde tables is that AvroContainerOutputFormat
doesn't meaningfully implement getRecordWriter, so fixing the above will just push the failure
into the placeholder RecordWriter.

This message was sent by Atlassian JIRA

View raw message