From user-return-189-archive-asf-public=cust-asf.ponee.io@orc.apache.org Mon Jan 29 23:28:59 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id D6D21180654 for ; Mon, 29 Jan 2018 23:28:59 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C64E5160C31; Mon, 29 Jan 2018 22:28:59 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E8259160C2F for ; Mon, 29 Jan 2018 23:28:58 +0100 (CET) Received: (qmail 44360 invoked by uid 500); 29 Jan 2018 22:28:58 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 44349 invoked by uid 99); 29 Jan 2018 22:28:57 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jan 2018 22:28:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 16697C040D for ; Mon, 29 Jan 2018 22:28:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ORl-yqHrpAjC for ; Mon, 29 Jan 2018 22:28:55 +0000 (UTC) Received: from mail-ot0-f177.google.com (mail-ot0-f177.google.com [74.125.82.177]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9297260D52 for ; Mon, 29 Jan 2018 22:28:55 +0000 (UTC) Received: by mail-ot0-f177.google.com with SMTP id f100so8107804otf.3 for ; Mon, 29 Jan 2018 14:28:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=E8VTX/f6kEqXBjluyKsxZpmh5+KN4dA7Pcg8cPt5uK8=; b=uKC7DVdyUr0k0EK+NFNOi7qHE1+o9O+XWIaM/5EwweKy01pzy/C1dWjCz+BrHYhe2z nGZRTmgPxkp60WVDl+fiRlBPY72xr+aMfI3FxwIz4nfrvziDT5ugHddcoEAWN4J3erpm EEN0fYi5R8Wc+CCZTsad44X3zCbS8h6Y9XZD5khWJgP18y7ym/nxl8F/Mp0JW55vuIRb 9hY/JoQYXONYJ4DVbfx3Qrvmkv1PL/zvQfak3eNACN33RDR7PG1GdnxEAOB/9FdMENyG 3cb7OVqL4oVtUmRB7FIdrsJbZxHn2p2uTShevNThedZpoEy1qwRV8VtHeeoiaew3vgKg w2JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=E8VTX/f6kEqXBjluyKsxZpmh5+KN4dA7Pcg8cPt5uK8=; b=rDIIE0CKdb3WyurzjaOvYY9X4UYqXZyePxChP35M3lE2gmFHtcF0JA5RKmR68bmC40 Wgqa6xmWORR6YWLKV1BhbwrB/qdZrIaIwdANHoDrcm0papn1LlwMMDuBT7438MJyZgut gbdWGJZKBDOMQ5KaOaKqbdns6TMo1J+HdSKe/pWQJjY2JWJCPDYf2zlqoYZgdAcEWFp1 XzjS6yjSBHl1Ce7MYUgoMUqfCC+ac9Wsu7//OsB/b5IKi9wYJjQbfNRXTKwOCLwU9OTW xE7M9SKBIFQ+32sdQr0AyX+LKg6W3hoq3SaFl5gOvN8vCHeuJ+4daXFmcQEM1CtfJvH/ vRNQ== X-Gm-Message-State: AKwxytc+6ZQo3gEeTXobGMCcJEZxqLkQ3ZXPt4CNjUk7TVgSFFJSbYvy 7eVD/SBRw1s3bphHkO6BBFNYgcAhUqpZFMhilRo6fop6 X-Google-Smtp-Source: AH8x227zBme7b2TVy9nflgKlr8qcpv8pXSemUPb9ZiXXXo+wlw4M7wLCtFLcjuN82DONHbzCjk6Pw7pNJk+uX2vZWkY= X-Received: by 10.157.64.140 with SMTP id n12mr4500217ote.151.1517264929248; Mon, 29 Jan 2018 14:28:49 -0800 (PST) MIME-Version: 1.0 Received: by 10.157.44.67 with HTTP; Mon, 29 Jan 2018 14:28:48 -0800 (PST) In-Reply-To: References: From: "Owen O'Malley" Date: Mon, 29 Jan 2018 14:28:48 -0800 Message-ID: Subject: Re: Questions regarding hive --orcfiledump or exporting orcfiles To: user@orc.apache.org Content-Type: multipart/alternative; boundary="94eb2c1c227423597e0563f1c5a8" --94eb2c1c227423597e0563f1c5a8 Content-Type: text/plain; charset="UTF-8" There are some details, but fundamentally yes. For non-partitioned tables, I'd probably distcp somewhere else and then use: hive> load data inpath 'hdfs:/staging/' overwrite into table MyTable; Assuming the table is backed by the same HDFS, it will do a hdfs mv to place the data in the right directory. Partitioned tables would need to be done with a load data command per a partition. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables If you are using ACID tables, it is more complicated. .. Owen On Mon, Jan 29, 2018 at 2:12 PM, Colin Williams < colin.williams.seattle@gmail.com> wrote: > Hi Owen, > > New to hive. > > Is the process as easy as > > # Export database > 1) distcp a database hdfs path to s3 > > # Import database > 2) distcp the s3 database path back to an hdfs path > 3) use CREATE TABLE statment from hive and set LOCATOIN as hdfs path? > > On Mon, Jan 29, 2018 at 1:55 PM, Owen O'Malley > wrote: > >> My guess is that you should be able to save a fair amount of time by >> doing a byte copy rather than rewriting the ORC file. >> >> To get a distributed copy, you'd probably want to use distcp and then >> create the necessary tables and partitions for your Hive metastore. >> >> .. Owen >> >> >> On Mon, Jan 29, 2018 at 1:16 PM, Colin Williams < >> colin.williams.seattle@gmail.com> wrote: >> >>> Hello, >>> >>> Wasn't sure if I should ask here or on the Hive mailing list. We're >>> creating External tables from an S3 bucket that contains some textfile >>> records. Then we import these tables with STORED AS ORC. >>> >>> We have about 20 tables, and it takes a couple hours to create the >>> tables. However currently we are just using a static data set. >>> >>> Then I'm wondering can I reduce the load time by exporting the tables >>> using hive --orcfiledump or just copying the files from HDFS into an S3 >>> bucket. And then load into HDFS again? Will this likely save me a bit of >>> load time? >>> >>> >>> Best, >>> >>> Colin Williams >>> >> >> > --94eb2c1c227423597e0563f1c5a8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
There are some details, but fundamentally yes.

For non-partitioned tables, I'd probably distcp somewhere else = and then use:

hive> load data inpath 'hdfs:= /staging/' overwrite into table MyTable;

= Assuming the table is backed by the same HDFS, it will do a hdfs mv to plac= e the data in the right directory.

Partitioned tab= les would need to be done with a load data command per a partition.


If you a= re using ACID tables, it is more complicated.

.. O= wen

On= Mon, Jan 29, 2018 at 2:12 PM, Colin Williams <colin.willia= ms.seattle@gmail.com> wrote:
Hi Owen,

New to hive.

Is the process as easy as

# E= xport database
1) distcp a database hdfs path to s3

# Import database
2) distcp the s3 database p= ath back to an hdfs path
3) use CREATE TABLE statment from hive a= nd set LOCATOIN as hdfs path?

On Mon,= Jan 29, 2018 at 1:55 PM, Owen O'Malley <owen.omalley@gmail.com= > wrote:
<= div class=3D"gmail_extra">My guess is that you should be able to save a fai= r amount of time by doing a byte copy rather than rewriting the ORC file.

To get = a distributed copy, you'd probably want to use distcp and then create t= he necessary tables and partitions for your Hive metastore.

.. Owen
<= div>

<= /div>

On Mon, Jan = 29, 2018 at 1:16 PM, Colin Williams <colin.williams.seattle= @gmail.com> wrote:
Hello,

Wasn't sure if I sho= uld ask here or on the Hive mailing list. We're creating External table= s from an S3 bucket that contains some textfile records. Then we import the= se tables with STORED AS ORC.

We have about 2= 0 tables, and it takes a couple hours to create the tables. However current= ly we are just using a static data set.

Then I'= ;m wondering can I reduce the load time by exporting the tables using hive = --orcfiledump or just copying the files from HDFS into an S3 bucket. And th= en load into HDFS again? Will this likely save me a bit of load time?
=


Best,

Colin Wil= liams



--94eb2c1c227423597e0563f1c5a8--