From user-return-189-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Mon Jan 29 23:28:59 2018
Return-Path: <user-return-189-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id D6D21180654
	for <archive-asf-public@eu.ponee.io>; Mon, 29 Jan 2018 23:28:59 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id C64E5160C31; Mon, 29 Jan 2018 22:28:59 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id E8259160C2F
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 29 Jan 2018 23:28:58 +0100 (CET)
Received: (qmail 44360 invoked by uid 500); 29 Jan 2018 22:28:58 -0000
Mailing-List: contact user-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@orc.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@orc.apache.org>
List-Post: <mailto:user@orc.apache.org>
List-Id: <user.orc.apache.org>
Reply-To: user@orc.apache.org
Delivered-To: mailing list user@orc.apache.org
Received: (qmail 44349 invoked by uid 99); 29 Jan 2018 22:28:57 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jan 2018 22:28:57 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 16697C040D
	for <user@orc.apache.org>; Mon, 29 Jan 2018 22:28:57 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.879
X-Spam-Level: *
X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id ORl-yqHrpAjC for <user@orc.apache.org>;
	Mon, 29 Jan 2018 22:28:55 +0000 (UTC)
Received: from mail-ot0-f177.google.com (mail-ot0-f177.google.com [74.125.82.177])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9297260D52
	for <user@orc.apache.org>; Mon, 29 Jan 2018 22:28:55 +0000 (UTC)
Received: by mail-ot0-f177.google.com with SMTP id f100so8107804otf.3
        for <user@orc.apache.org>; Mon, 29 Jan 2018 14:28:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
        bh=E8VTX/f6kEqXBjluyKsxZpmh5+KN4dA7Pcg8cPt5uK8=;
        b=uKC7DVdyUr0k0EK+NFNOi7qHE1+o9O+XWIaM/5EwweKy01pzy/C1dWjCz+BrHYhe2z
         nGZRTmgPxkp60WVDl+fiRlBPY72xr+aMfI3FxwIz4nfrvziDT5ugHddcoEAWN4J3erpm
         EEN0fYi5R8Wc+CCZTsad44X3zCbS8h6Y9XZD5khWJgP18y7ym/nxl8F/Mp0JW55vuIRb
         9hY/JoQYXONYJ4DVbfx3Qrvmkv1PL/zvQfak3eNACN33RDR7PG1GdnxEAOB/9FdMENyG
         3cb7OVqL4oVtUmRB7FIdrsJbZxHn2p2uTShevNThedZpoEy1qwRV8VtHeeoiaew3vgKg
         w2JQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to;
        bh=E8VTX/f6kEqXBjluyKsxZpmh5+KN4dA7Pcg8cPt5uK8=;
        b=rDIIE0CKdb3WyurzjaOvYY9X4UYqXZyePxChP35M3lE2gmFHtcF0JA5RKmR68bmC40
         Wgqa6xmWORR6YWLKV1BhbwrB/qdZrIaIwdANHoDrcm0papn1LlwMMDuBT7438MJyZgut
         gbdWGJZKBDOMQ5KaOaKqbdns6TMo1J+HdSKe/pWQJjY2JWJCPDYf2zlqoYZgdAcEWFp1
         XzjS6yjSBHl1Ce7MYUgoMUqfCC+ac9Wsu7//OsB/b5IKi9wYJjQbfNRXTKwOCLwU9OTW
         xE7M9SKBIFQ+32sdQr0AyX+LKg6W3hoq3SaFl5gOvN8vCHeuJ+4daXFmcQEM1CtfJvH/
         vRNQ==
X-Gm-Message-State: AKwxytc+6ZQo3gEeTXobGMCcJEZxqLkQ3ZXPt4CNjUk7TVgSFFJSbYvy
	7eVD/SBRw1s3bphHkO6BBFNYgcAhUqpZFMhilRo6fop6
X-Google-Smtp-Source: AH8x227zBme7b2TVy9nflgKlr8qcpv8pXSemUPb9ZiXXXo+wlw4M7wLCtFLcjuN82DONHbzCjk6Pw7pNJk+uX2vZWkY=
X-Received: by 10.157.64.140 with SMTP id n12mr4500217ote.151.1517264929248;
 Mon, 29 Jan 2018 14:28:49 -0800 (PST)
MIME-Version: 1.0
Received: by 10.157.44.67 with HTTP; Mon, 29 Jan 2018 14:28:48 -0800 (PST)
In-Reply-To: <CAPXXXSD4rAar4=Mth2LUimQ8jPp1RK=P3jBS_qsEHBethTwsow@mail.gmail.com>
References: <CAPXXXSDb1m5o8=dS+G17yQvwN6FUjcjEE5B-QE7w2RhNWxw60A@mail.gmail.com>
 <CAHfHakFwcHqwYQoujVJYgrkZKMrbL5mbW0xkN+YK0YuPGPWM8w@mail.gmail.com> <CAPXXXSD4rAar4=Mth2LUimQ8jPp1RK=P3jBS_qsEHBethTwsow@mail.gmail.com>
From: "Owen O'Malley" <owen.omalley@gmail.com>
Date: Mon, 29 Jan 2018 14:28:48 -0800
Message-ID: <CAHfHakHO9wdQOULpZoo7NohPT03rttLNfCxNNJPs6F_ZCD2Wjw@mail.gmail.com>
Subject: Re: Questions regarding hive --orcfiledump or exporting orcfiles
To: user@orc.apache.org
Content-Type: multipart/alternative; boundary="94eb2c1c227423597e0563f1c5a8"

--94eb2c1c227423597e0563f1c5a8
Content-Type: text/plain; charset="UTF-8"

There are some details, but fundamentally yes.

For non-partitioned tables, I'd probably distcp somewhere else and then use:

hive> load data inpath 'hdfs:/staging/' overwrite into table MyTable;

Assuming the table is backed by the same HDFS, it will do a hdfs mv to
place the data in the right directory.

Partitioned tables would need to be done with a load data command per a
partition.

See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

If you are using ACID tables, it is more complicated.

.. Owen

On Mon, Jan 29, 2018 at 2:12 PM, Colin Williams <
colin.williams.seattle@gmail.com> wrote:

> Hi Owen,
>
> New to hive.
>
> Is the process as easy as
>
> # Export database
> 1) distcp a database hdfs path to s3
>
> # Import database
> 2) distcp the s3 database path back to an hdfs path
> 3) use CREATE TABLE statment from hive and set LOCATOIN as hdfs path?
>
> On Mon, Jan 29, 2018 at 1:55 PM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
>
>> My guess is that you should be able to save a fair amount of time by
>> doing a byte copy rather than rewriting the ORC file.
>>
>> To get a distributed copy, you'd probably want to use distcp and then
>> create the necessary tables and partitions for your Hive metastore.
>>
>> .. Owen
>>
>>
>> On Mon, Jan 29, 2018 at 1:16 PM, Colin Williams <
>> colin.williams.seattle@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Wasn't sure if I should ask here or on the Hive mailing list. We're
>>> creating External tables from an S3 bucket that contains some textfile
>>> records. Then we import these tables with STORED AS ORC.
>>>
>>> We have about 20 tables, and it takes a couple hours to create the
>>> tables. However currently we are just using a static data set.
>>>
>>> Then I'm wondering can I reduce the load time by exporting the tables
>>> using hive --orcfiledump or just copying the files from HDFS into an S3
>>> bucket. And then load into HDFS again? Will this likely save me a bit of
>>> load time?
>>>
>>>
>>> Best,
>>>
>>> Colin Williams
>>>
>>
>>
>

--94eb2c1c227423597e0563f1c5a8
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">There are some details, but fundamentally yes.<div><br></d=
iv><div>For non-partitioned tables, I&#39;d probably distcp somewhere else =
and then use:</div><div><br></div><div>hive&gt; load data inpath &#39;hdfs:=
/staging/&#39; overwrite into table MyTable;</div><div><div><br></div><div>=
Assuming the table is backed by the same HDFS, it will do a hdfs mv to plac=
e the data in the right directory.</div><div><br></div><div>Partitioned tab=
les would need to be done with a load data command per a partition.</div></=
div><div><br></div><div>See=C2=A0<a href=3D"https://cwiki.apache.org/conflu=
ence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotabl=
es">https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#Lan=
guageManualDML-Loadingfilesintotables</a></div><div><br></div><div>If you a=
re using ACID tables, it is more complicated.</div><div><br></div><div>.. O=
wen</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On=
 Mon, Jan 29, 2018 at 2:12 PM, Colin Williams <span dir=3D"ltr">&lt;<a href=
=3D"mailto:colin.williams.seattle@gmail.com" target=3D"_blank">colin.willia=
ms.seattle@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr"><div>Hi Owen,</div><div><br></div><div>New to hive.</div=
><div><br></div><div>Is the process as easy as</div><div><br></div><div># E=
xport database<br></div><div>1) distcp a database hdfs path to s3</div><div=
><br></div><div># Import database<br></div><div>2) distcp the s3 database p=
ath back to an hdfs path</div><div>3) use CREATE TABLE statment from hive a=
nd set LOCATOIN as hdfs path?<br></div></div><div class=3D"HOEnZb"><div cla=
ss=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon,=
 Jan 29, 2018 at 1:55 PM, Owen O&#39;Malley <span dir=3D"ltr">&lt;<a href=
=3D"mailto:owen.omalley@gmail.com" target=3D"_blank">owen.omalley@gmail.com=
</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><=
div class=3D"gmail_extra">My guess is that you should be able to save a fai=
r amount of time by doing a byte copy rather than rewriting the ORC file.</=
div><div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">To get =
a distributed copy, you&#39;d probably want to use distcp and then create t=
he necessary tables and partitions for your Hive metastore.</div><span clas=
s=3D"m_8661077448436834211HOEnZb"><font color=3D"#888888"><div class=3D"gma=
il_extra"><br></div><div class=3D"gmail_extra">.. Owen</div></font></span><=
div><div class=3D"m_8661077448436834211h5"><div class=3D"gmail_extra"><br><=
/div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Jan =
29, 2018 at 1:16 PM, Colin Williams <span dir=3D"ltr">&lt;<a href=3D"mailto=
:colin.williams.seattle@gmail.com" target=3D"_blank">colin.williams.seattle=
@gmail.<wbr>com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"ltr"><div>Hello,</div><div><br></div><div>Wasn&#39;t sure if I sho=
uld ask here or on the Hive mailing list. We&#39;re creating External table=
s from an S3 bucket that contains some textfile records. Then we import the=
se tables with STORED AS ORC. <br></div><div><br></div><div>We have about 2=
0 tables, and it takes a couple hours to create the tables. However current=
ly we are just using a static data set.</div><div><br></div><div>Then I&#39=
;m wondering can I reduce the load time by exporting the tables using hive =
--orcfiledump or just copying the files from HDFS into an S3 bucket. And th=
en load into HDFS again? Will this likely save me a bit of load time?</div>=
<div><br></div><div><br></div><div>Best,</div><div><br></div><div>Colin Wil=
liams<br></div></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--94eb2c1c227423597e0563f1c5a8--