hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-4765) Improve HBase bulk loading facility
Date Thu, 31 Jul 2014 22:17:39 GMT


Sushanth Sowmyan commented on HIVE-4765:

[~navis], this patch is an exciting one for me, because I've long wanted to work on introducing
OutputCommitter semantics into hive. And given that we've wanted to revamp the hbase bulk
load as well for a while, this is a double-win for me.

That said, I do have a few thoughts on the introduction of the HiveOutputCommitter.

a) I like that you added a completed() along witht he commit() that allows signalling the
end of the commit process. This is a good addition. I think I would have liked some way to
add a failed() or equivalent also, I think, to make sure we can signal that something on our
end failed, say while moving files or somesuch.

b) One of my pet peeves with HiveOutputFormat in general is the impedance mismatches in RecordWriter
vs. HiveRecordWriter, and the lack of an OutputCommitter has meant that generic OutputFormats
would need to be ported over to Hive, or developed completely within hive, rather than being
usable as-is. Thus, one of my major goals for introducing an OutputCommitter semantic would
be to reduce that mismatch, and move hive towards being able to consume a generic M/R IF /
OF with no additional work. To this end, I'm a little wary of introducing a HiveOutputCommitter
that will similarly have a mismatch that needs to be "fixed" in the way that the HiveRecordWriter
needs to be, in case people implement the interface currently being introduced, and then we
worry about having to break them to clean up the interface.

c) I would prefer HiveOutputFormat to have a method to create/return an output committer(with
a default impl returning null), rather than extend HiveOutputCommitter. This matches the M/R
form closer and will make it easier to bridge that gap, I think.

Also, if there was any particular reason you intentionally avoided the M/R Committer idiom,
I'd be happy to hear that as well, and we can think on how to create a generic M/R storage
handler to wrap generic M/R IF/OFs if need be.

> Improve HBase bulk loading facility
> -----------------------------------
>                 Key: HIVE-4765
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: HIVE-4765.2.patch.txt, HIVE-4765.3.patch.txt, HIVE-4765.D11463.1.patch
> With some patches, bulk loading process for HBase could be simplified a lot.
> {noformat}
> CREATE EXTERNAL TABLE hbase_export(rowkey STRING, col1 STRING, col2 STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseExportSerDe'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:key,cf2:value")
>   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileExporter'
> LOCATION '/tmp/export';
> SET mapred.reduce.tasks=4;
> set hive.optimize.sampling.orderby=true;
> SELECT * from (SELECT union_kv(key,key,value,":key,cf1:key,cf2:value") as (rowkey,union)
FROM src) A ORDER BY rowkey,union;
> hive> !hadoop fs -lsr /tmp/export;                                               
> drwxr-xr-x   - navis supergroup          0 2013-06-20 11:05 /tmp/export/cf1
> -rw-r--r--   1 navis supergroup       4317 2013-06-20 11:05 /tmp/export/cf1/384abe795e1a471cac6d3770ee38e835
> -rw-r--r--   1 navis supergroup       5868 2013-06-20 11:05 /tmp/export/cf1/b8b6d746c48f4d12a4cf1a2077a28a2d
> -rw-r--r--   1 navis supergroup       5214 2013-06-20 11:05 /tmp/export/cf1/c8be8117a1734bd68a74338dfc4180f8
> -rw-r--r--   1 navis supergroup       4290 2013-06-20 11:05 /tmp/export/cf1/ce41f5b1cfdc4722be25207fc59a9f10
> drwxr-xr-x   - navis supergroup          0 2013-06-20 11:05 /tmp/export/cf2
> -rw-r--r--   1 navis supergroup       6744 2013-06-20 11:05 /tmp/export/cf2/409673b517d94e16920e445d07710f52
> -rw-r--r--   1 navis supergroup       4975 2013-06-20 11:05 /tmp/export/cf2/96af002a6b9f4ebd976ecd83c99c8d7e
> -rw-r--r--   1 navis supergroup       6096 2013-06-20 11:05 /tmp/export/cf2/c4f696587c5e42ee9341d476876a3db4
> -rw-r--r--   1 navis supergroup       4890 2013-06-20 11:05 /tmp/export/cf2/fd9adc9e982f4fe38c8d62f9a44854ba
> hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/export test
> {noformat}

This message was sent by Atlassian JIRA

View raw message