orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: Questions regarding hive --orcfiledump or exporting orcfiles
Date Mon, 29 Jan 2018 22:28:48 GMT
There are some details, but fundamentally yes.

For non-partitioned tables, I'd probably distcp somewhere else and then use:

hive> load data inpath 'hdfs:/staging/' overwrite into table MyTable;

Assuming the table is backed by the same HDFS, it will do a hdfs mv to
place the data in the right directory.

Partitioned tables would need to be done with a load data command per a


If you are using ACID tables, it is more complicated.

.. Owen

On Mon, Jan 29, 2018 at 2:12 PM, Colin Williams <
colin.williams.seattle@gmail.com> wrote:

> Hi Owen,
> New to hive.
> Is the process as easy as
> # Export database
> 1) distcp a database hdfs path to s3
> # Import database
> 2) distcp the s3 database path back to an hdfs path
> 3) use CREATE TABLE statment from hive and set LOCATOIN as hdfs path?
> On Mon, Jan 29, 2018 at 1:55 PM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
>> My guess is that you should be able to save a fair amount of time by
>> doing a byte copy rather than rewriting the ORC file.
>> To get a distributed copy, you'd probably want to use distcp and then
>> create the necessary tables and partitions for your Hive metastore.
>> .. Owen
>> On Mon, Jan 29, 2018 at 1:16 PM, Colin Williams <
>> colin.williams.seattle@gmail.com> wrote:
>>> Hello,
>>> Wasn't sure if I should ask here or on the Hive mailing list. We're
>>> creating External tables from an S3 bucket that contains some textfile
>>> records. Then we import these tables with STORED AS ORC.
>>> We have about 20 tables, and it takes a couple hours to create the
>>> tables. However currently we are just using a static data set.
>>> Then I'm wondering can I reduce the load time by exporting the tables
>>> using hive --orcfiledump or just copying the files from HDFS into an S3
>>> bucket. And then load into HDFS again? Will this likely save me a bit of
>>> load time?
>>> Best,
>>> Colin Williams

View raw message