hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter
Date Thu, 30 Oct 2014 09:22:33 GMT


Sushanth Sowmyan commented on HIVE-4329:

Despite my initial reservations on approach, I've been trying to extend and make this patch
work and get it in 0.14 because the functionality it introduces is important.  Last week,
I'd pinged Vikram to get it okayed for 0.14. However, as of this time, on reviewing and debugging,
this patch is still incomplete. 

The test failure from org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.testPigPopulation
reported above is because this does not call FileSinkOperator.checkOutputSpecs, which thus,
does not wind up populating the "actualOutputFormat", and thus, PassthroughOutputFormat thinks
its underlying OutputFormat is null. Also, it's not a simple matter of simply calling that
function, since that function depends on the FileSinkOperator having been instantiated, and
having a TableDesc in its context. That, at least, is fixable, since HCatalog does have access
to a TableDesc, in which case, HCatalog will then need to do some detection to see if the
underlying OF is a PassthroughOutputFormat, and if so, then will need to instantiate PassthroughOutputFormat
appropriately by calling a refactored FileSinkOperator.checkOutputSpecs that does not require
the Operator itself.

This currently still breaks the traditional M/R OutputFormat usage under HCatalog usecase.
At this point, I think it's easier to try and fix the underlying issue of making Avro work
with HCatalog than to try rushing this patch into a 0.14 timeframe.

( Having said that, PassthroughOutputFormat is itself pretty broken, since it stores the realoutputFormat
as a static string in HiveFileFormatUtils, which currently breaks current usecases like calling
HBase through HS2, and then attempting to use any other M/R O/F like Accumulo (since HS2 winds
up being a persistent process that retains the older versions of that static variable). It
doesn't break in cases of hive commandline itself, if you write to only one M/R-OF based output
in one query. That is a separate bug that is not this patch's fault, but this patch makes
HCatalog depend on PassthroughOutputFormat, and HCat does get used in a multiple use per process
scenario which affects it. (I'll file another jira on that issue soon - I've been debugging
that issue) We may rely on PassthroughOutputFormat in the short term, but we really need to
move off that and support M/R OFs natively(with native MR OutputCommitter semantics) in hive

> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>                 Key: HIVE-4329
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause impacts all
non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>            Priority: Critical
>             Fix For: 0.14.0
>         Attachments: HIVE-4329.0.patch, HIVE-4329.1.patch, HIVE-4329.2.patch, HIVE-4329.3.patch,
HIVE-4329.4.patch, HIVE-4329.5.patch
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails with the
following stacktrace:
> {code}
> java.lang.ClassCastException: cannot be cast to
> 	at$1.write(
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(
> 	at org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(
> 	at org.apache.hcatalog.pig.HCatBaseStorer.putNext(
> 	at org.apache.hcatalog.pig.HCatStorer.putNext(
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(
> 	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(
> 	at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's signature
mandates a LongWritable key and HCat's FileRecordWriterContainer forces a NullWritable. I'm
not sure of a general fix, other than redefining HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive OutputFormats,
and there's no reason AvroContainerOutputFormat couldn't also be changed, since it's ignoring
the key. That way fixing things so FileRecordWriterContainer can always use NullWritable could
get spun into a different issue?
> The underlying cause for failure to write to AvroSerde tables is that AvroContainerOutputFormat
doesn't meaningfully implement getRecordWriter, so fixing the above will just push the failure
into the placeholder RecordWriter.

This message was sent by Atlassian JIRA

View raw message