hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Ervits <dbis...@gmail.com>
Subject Re: Hive ExIm from on-premise HDP to Amazon EMR
Date Sun, 24 Jan 2016 23:06:51 GMT
Have you looked at Apache Falcon?
On Jan 8, 2016 2:41 AM, "Elliot West" <teabot@gmail.com> wrote:

> Further investigation appears to show this going wrong in a copy phase of
> the plan. The correctly functioning HDFS → HDFS import copy stage looks
> like this:
>
> STAGE PLANS:
>   Stage: Stage-1
>     Copy
>       source: hdfs://host:8020/staging/my_table/year_month=2015-12
>       destination:
> hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-10000
>
>
> Whereas the S3 → S3 import copy stage shows an unexpected destination,
> which was presumably meant to be a temporary location on the source file
> system but is in fact simply the parent directory:
>
>
> STAGE PLANS:
>   Stage: Stage-1
>     Copy
>       source: s3n://exports-bucket/my_table/year_month=2015-12
>       destination: s3n://exports-bucket/my_table
>
>
> These stage plans were obtained using:
>
> EXPLAIN
> IMPORT FROM 'spource
> LOCATION 'destination';
>
>
> I'm beginning to think that this is a bug and not something I can work
> around, which is unfortunate as I'm not really in a position to deploy a
> fixed version in the short term. That said, if you confirm that this is not
> the intended behaviour, I'll raise a JIRA and possibly work on a fix.
>
> Thanks - Elliot.
>
>
> On 7 January 2016 at 16:53, Elliot West <teabot@gmail.com> wrote:
>
>> More information: This works if I move the export into EMR's HDFS and
>> then import from there to a new location in HDFS. It does not work across
>> FileSystems:
>>
>>    - Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
>>    - Import from EMR HDFS → S3 (complains that HDFS FileSystem was
>>    expected as the destination. Presumably the same FileSystem instance
>>    is used for the source and destination).
>>
>>
>>
>> On 7 January 2016 at 12:17, Elliot West <teabot@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Following on from my earlier post concerning syncing Hive data from an
>>> on premise cluster to the cloud, I've been experimenting with the
>>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
>>> Amazon EMR. I started out with some simple Exports/Imports as these can be
>>> the core operations on which replication is founded. This worked fine with
>>> some on-premise clusters running HDP-2.2.4.
>>>
>>>
>>> // on cluster 1
>>>
>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>> TO '/exports/my_table'
>>> FOR REPLICATION ('1');
>>>
>>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>>>
>>> // on cluster 2
>>>
>>> IMPORT FROM '/staging/my_table'
>>> LOCATION '/warehouse/my_table';
>>>
>>> // Table created, partition created, data relocated to
>>> /warehouse/my_table/year_month=2015-12
>>>
>>>
>>> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>>>
>>> // On premise HDP2.2.4
>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>
>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>>>
>>> // on EMR
>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>
>>> IMPORT FROM 's3n://exports-bucket/my_table'
>>> LOCATION 's3n://hive-warehouse-bucket/my_table'
>>>
>>>
>>> The IMPORT behaviour I see is bizarre:
>>>
>>>    1. Creates the folder 's3n://hive-warehouse/my_table'
>>>    2. Copies the part file from
>>>    's3n://exports-bucket/my_table/year_month=2015-12' to
>>>    's3n://exports-bucket/my_table' (i.e. to the parent)
>>>    3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>>>    s3n://exports-bucket/my_table has nested
>>>    directorys3n://exports-bucket/my_table/year_month=2015-12"
>>>
>>> It is as if it is attempting to set the final partition location to
>>> 's3n://exports-bucket/my_table' and not
>>> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
>>> HDP → HDP.
>>>
>>> I've tried variations, specifying the partition on import, excluding the
>>> location, all with the same result. Any thoughts or assistance would be
>>> appreciated.
>>>
>>> Thanks - Elliot.
>>>
>>>
>>>
>>
>

Mime
View raw message