Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A8CF182D7 for ; Mon, 25 Jan 2016 08:09:53 +0000 (UTC) Received: (qmail 37470 invoked by uid 500); 25 Jan 2016 08:09:52 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 37392 invoked by uid 500); 25 Jan 2016 08:09:52 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 37382 invoked by uid 99); 25 Jan 2016 08:09:52 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2016 08:09:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 99C1CC0596 for ; Mon, 25 Jan 2016 08:09:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.879 X-Spam-Level: *** X-Spam-Status: No, score=3.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id QdO9uAa06HSl for ; Mon, 25 Jan 2016 08:09:49 +0000 (UTC) Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 368E8439F8 for ; Mon, 25 Jan 2016 08:09:49 +0000 (UTC) Received: by mail-wm0-f54.google.com with SMTP id 123so54742846wmz.0 for ; Mon, 25 Jan 2016 00:09:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=6JGgixpHftzuxnaqVC7ujI7IvJkx1E4s8GwwOeGtrxw=; b=xvB7gXw8AWAVOS0Dt0SXBF/q6CVhjvklcYg62Q0FiNbClCyb4lRpG0y7X0lhQgppJf 3cTIUDRTxQxMIIl7YFN5ceSZFLEnIjhxhuWbtYHU2pU81nlHqgapHyQviWJoE/KuGHeN T201KKhMxjIzTr0Vy30FEUexgIc9mQEI6ngJrCF0ggGYqnQadJcMSxTLscm9Armbg5hJ cwJorJJZn4KFmCDlpv0bsfRode36iIOsWxx9YwDY1WbTgOCcNMgXUuLr8QCdQdVDrvt3 ZJK+mDUk8ePdQ4Z97Pu4Ce3WLbwKILCMBvFDPqdUwmfKG8e/5CuRNc+I3NJ4QQb5qfla WN1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=6JGgixpHftzuxnaqVC7ujI7IvJkx1E4s8GwwOeGtrxw=; b=Q+3P+WRz4WT1kkIxXfvFYcm6BqwbjIG2LUKoLHiTcaY2t/xhFXdDGXsrMdqFLnimyX 6B7AOela5aJJYXW2l4CmD7h0f/u0qP5r5HH/uZVdHj+VYxFikilbkZ6SzSDRm0w8JWtw efW0pTdgkeToL2NU6IefVKMFnyZVJB82LlAWXEVapCMV2nwaCumpyq5upreE8V3OtObK 6KyBSCI13TyljVY9RH/Z9qkz03/6X94SgARCp2kWs46r2PVmjed0u1KraprHlGMOFd7j C8CmLblrb5al4Yyu52llz0XZ6kRFK212NqlAdo1NFEJ0Nn04sKzP3XQwFQMRDiB8soTw MRYQ== X-Gm-Message-State: AG10YOR7BaJOtHNLyTSkZZ2+lTHZ2R+NyRGsqV3+Dsa9K45wOLXgRzU0sEHc13sqFg6yROKv+dIfBLt6LtR2PQ== MIME-Version: 1.0 X-Received: by 10.194.89.229 with SMTP id br5mr4213654wjb.5.1453709388298; Mon, 25 Jan 2016 00:09:48 -0800 (PST) Received: by 10.194.201.234 with HTTP; Mon, 25 Jan 2016 00:09:48 -0800 (PST) In-Reply-To: References: Date: Mon, 25 Jan 2016 08:09:48 +0000 Message-ID: Subject: Re: Hive ExIm from on-premise HDP to Amazon EMR From: Elliot West To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=089e01030034b24629052a241709 --089e01030034b24629052a241709 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes, we do use Falcon. But only a small fraction of our the datasets we wish to replicate are defined in this way. Could I perhaps just declare the feeds in falcon and not the processes that create them? Also, doesn't falcon use Hive ExIm/Replication to achieve this internally and therefore might I still encounter the same bug I am seeing now? Thanks for your response. On Sunday, 24 January 2016, Artem Ervits wrote: > Have you looked at Apache Falcon? > On Jan 8, 2016 2:41 AM, "Elliot West" > wrote: > >> Further investigation appears to show this going wrong in a copy phase o= f >> the plan. The correctly functioning HDFS =E2=86=92 HDFS import copy stag= e looks >> like this: >> >> STAGE PLANS: >> Stage: Stage-1 >> Copy >> source: hdfs://host:8020/staging/my_table/year_month=3D2015-12 >> destination: >> hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hi= ve_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-10000 >> >> >> Whereas the S3 =E2=86=92 S3 import copy stage shows an unexpected destin= ation, >> which was presumably meant to be a temporary location on the source file >> system but is in fact simply the parent directory: >> >> >> STAGE PLANS: >> Stage: Stage-1 >> Copy >> source: s3n://exports-bucket/my_table/year_month=3D2015-12 >> destination: s3n://exports-bucket/my_table >> >> >> These stage plans were obtained using: >> >> EXPLAIN >> IMPORT FROM 'spource >> LOCATION 'destination'; >> >> >> I'm beginning to think that this is a bug and not something I can work >> around, which is unfortunate as I'm not really in a position to deploy a >> fixed version in the short term. That said, if you confirm that this is = not >> the intended behaviour, I'll raise a JIRA and possibly work on a fix. >> >> Thanks - Elliot. >> >> >> On 7 January 2016 at 16:53, Elliot West > > wrote: >> >>> More information: This works if I move the export into EMR's HDFS and >>> then import from there to a new location in HDFS. It does not work acro= ss >>> FileSystems: >>> >>> - Import from S3 =E2=86=92 EMR HDFS (fails in a similar manner to S3= =E2=86=92 S3) >>> - Import from EMR HDFS =E2=86=92 S3 (complains that HDFS FileSystem = was >>> expected as the destination. Presumably the same FileSystem instance >>> is used for the source and destination). >>> >>> >>> >>> On 7 January 2016 at 12:17, Elliot West >> > wrote: >>> >>>> Hello, >>>> >>>> Following on from my earlier post concerning syncing Hive data from an >>>> on premise cluster to the cloud, I've been experimenting with the >>>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluste= r to >>>> Amazon EMR. I started out with some simple Exports/Imports as these ca= n be >>>> the core operations on which replication is founded. This worked fine = with >>>> some on-premise clusters running HDP-2.2.4. >>>> >>>> >>>> // on cluster 1 >>>> >>>> EXPORT TABLE my_table PARTITION (year_month=3D'2015-12') >>>> TO '/exports/my_table' >>>> FOR REPLICATION ('1'); >>>> >>>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table >>>> >>>> // on cluster 2 >>>> >>>> IMPORT FROM '/staging/my_table' >>>> LOCATION '/warehouse/my_table'; >>>> >>>> // Table created, partition created, data relocated to >>>> /warehouse/my_table/year_month=3D2015-12 >>>> >>>> >>>> I next tried similar with HDP-2.2.4 =E2=86=92 EMR (4.2.0) like so: >>>> >>>> // On premise HDP2.2.4 >>>> SET hiveconf:hive.exim.uri.scheme.whitelist=3Dhdfs,pfile,s3n; >>>> >>>> EXPORT TABLE my_table PARTITION (year_month=3D'2015-12') >>>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table' >>>> >>>> // on EMR >>>> SET hiveconf:hive.exim.uri.scheme.whitelist=3Dhdfs,pfile,s3n; >>>> >>>> IMPORT FROM 's3n://exports-bucket/my_table' >>>> LOCATION 's3n://hive-warehouse-bucket/my_table' >>>> >>>> >>>> The IMPORT behaviour I see is bizarre: >>>> >>>> 1. Creates the folder 's3n://hive-warehouse/my_table' >>>> 2. Copies the part file from >>>> 's3n://exports-bucket/my_table/year_month=3D2015-12' to >>>> 's3n://exports-bucket/my_table' (i.e. to the parent) >>>> 3. Fails with: "ERROR exec.Task: Failed with exception checkPaths: >>>> s3n://exports-bucket/my_table has nested >>>> directorys3n://exports-bucket/my_table/year_month=3D2015-12" >>>> >>>> It is as if it is attempting to set the final partition location to >>>> 's3n://exports-bucket/my_table' and not >>>> 's3n://hive-warehouse-bucket/my_table/year_month=3D2015-12' as happens= with >>>> HDP =E2=86=92 HDP. >>>> >>>> I've tried variations, specifying the partition on import, excluding >>>> the location, all with the same result. Any thoughts or assistance wou= ld be >>>> appreciated. >>>> >>>> Thanks - Elliot. >>>> >>>> >>>> >>> >> --089e01030034b24629052a241709 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes, we do use Falcon. But only a small fraction of our the=C2=A0datasets w= e wish to replicate are defined in this way. Could I perhaps just declare t= he feeds in falcon and not the processes that create them? Also, doesn'= t falcon use Hive ExIm/Replication to achieve this internally and therefore= might I still encounter the same bug I am seeing now?

T= hanks for your response.

On Sunday, 24 January 2016, Ar= tem Ervits <dbist13@gmail.com&g= t; wrote:

Have you looked = at Apache Falcon?

On Jan 8, 2016 2:41 AM, "Elliot West" = <teabot@gmail.com> wrote:
Further investigation = appears to show this going wrong in a copy phase of the plan. The correctly= functioning HDFS=C2=A0=E2=86=92 HDFS impo= rt copy stage looks like this:
=
STAGE PLANS:<= /div>
=C2=A0 Stage: Stage-1=
=C2=A0 =C2=A0= Copy
=C2=A0 = =C2=A0 =C2=A0 source: hdfs://host:8020/staging/my_table/year_month=3D2015-1= 2
=C2=A0 =C2= =A0 =C2=A0 destination: hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35= -95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-= 10000
=
Whereas the S3=C2=A0= =E2=86=92 S3=C2=A0import copy = stage shows an unexpected destination, which was presumably meant to be a t= emporary location on the source file system but is in fact simply the paren= t directory:=C2=A0

=
STAGE PLANS:
<= /div>
=C2=A0 Stage: Stage-1
=
=C2=A0 =C2=A0 Copy<= /div>
=C2=A0 =C2=A0 = =C2=A0 source: s3n://exports-bucket/my_table/year_month=3D2015-12
=C2=A0 =C2=A0 =C2=A0 de= stination: s3n://exports-bucket/my_table

These stage plans we= re obtained using:

EXPLAIN
IMPORT FROM 'spource
LOCATION 'destination';
I'm beginning to think that thi= s is a bug and not something I can work around, which is unfortunate as I&#= 39;m not really in a position to deploy a fixed version in the short term. = That said, if you confirm that this is not the intended behaviour, I'll= raise a JIRA and possibly work on a fix.

Thanks - Elliot.


On 7 January 2016 at 16:53, Elliot Wes= t <teabot@gmail.com>= wrote:
More information= : This works if I move the export into EMR's HDFS and then import from = there to a new location in HDFS. It does not work across FileSystems:
<= ul>
  • Import from S3=C2=A0=E2=86=92 EMR H= DFS (fails in a similar manner to S3=C2=A0=E2=86=92 S3)
  • Import from EMR HDFS=C2=A0=E2=86=92 S3 (complains that HDFS FileSystem was expe= cted as the destination. Presumably the same=C2=A0FileSystem instance is used for the source and destination).=


  • On 7 January 2016 at 12:17, Elliot West <teabot@gmail.com> wrote:
    Hello,

    Following on from my earl= ier post concerning syncing Hive data from an on premise cluster to the clo= ud, I've been experimenting with the IMPORT/EXPORT functionality to mov= e data from an on-premise HDP cluster to Amazon EMR. I started out with som= e simple Exports/Imports as these can be the core operations on which repli= cation is founded. This worked fine with some on-premise clusters running H= DP-2.2.4.

    // on cluster 1
    EXPORT TABLE my_table PARTITION (year_month=3D'2015-12')
    TO '/exports/my_table'
    FOR REPLICATION ('1');

    // Copy from cluster1:/exports/my_table to cluster2= :/staging/my_table

    // on cluster 2
    IMPORT FROM '/staging/my_table'
    LOCATION '= /warehouse/my_table';

    // Table created, partit= ion created, data relocated to /warehouse/my_table/year_month=3D2015-12

    I next tried similar with HDP-2.2.= 4 =E2=86=92 EMR (4.2.0) like so:

    // On premise HDP2.2.4=
    SET hiveconf:hive.exim.uri.scheme.whitelist=3Dhdfs,pfile,s3= n;

    EXPORT TABLE my_table PAR= TITION (year_month=3D'2015-12')
    TO 's3n://= API_KEY:SECRET_KEY@exports-bucket/my_table'

    // on EMR
    SET hiveconf:hive.exim.uri.sc= heme.whitelist=3Dhdfs,pfile,s3n;

    IMPORT F= ROM 's3n://exports-bucket/my_table'
    LOCATION &= #39;s3n://hive-warehouse-bucket/my_table'

    The IMPORT behaviour I see is bizarre:
      Creates the folder 's3n://hive-warehouse/my_table'
    1. Copies= the part file from 's3n://exports-bucket/my_table/year_month=3D2015-12= '=C2=A0to 's3n://exports-bucket/my_table' (i.e. to the parent)<= /li>
    2. Fails with: "ERROR exec.Task: Failed with exception checkPaths= : s3n://exports-bucket/my_table=C2=A0has nested directorys3n://exports-buck= et/my_table/year_month=3D2015-12"
    It is as if it is atte= mpting to set the final partition location to 's3n://exports-bucket/my_= table' and not 's3n://hive-warehouse-bucket/my_table/year_month=3D2= 015-12' as happens with HDP =E2=86=92 HDP.

    I've tried variations, specifying the partition on import, excluding= the location, all with the same result. Any thoughts or assistance would b= e appreciated.

    Thanks - Elliot.




    --089e01030034b24629052a241709--