hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Re: Hive ExIm from on-premise HDP to Amazon EMR
Date Mon, 25 Jan 2016 08:09:48 GMT
Yes, we do use Falcon. But only a small fraction of our the datasets we
wish to replicate are defined in this way. Could I perhaps just declare the
feeds in falcon and not the processes that create them? Also, doesn't
falcon use Hive ExIm/Replication to achieve this internally and therefore
might I still encounter the same bug I am seeing now?

Thanks for your response.

On Sunday, 24 January 2016, Artem Ervits <dbist13@gmail.com> wrote:

> Have you looked at Apache Falcon?
> On Jan 8, 2016 2:41 AM, "Elliot West" <teabot@gmail.com
> <javascript:_e(%7B%7D,'cvml','teabot@gmail.com');>> wrote:
>
>> Further investigation appears to show this going wrong in a copy phase of
>> the plan. The correctly functioning HDFS → HDFS import copy stage looks
>> like this:
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Copy
>>       source: hdfs://host:8020/staging/my_table/year_month=2015-12
>>       destination:
>> hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-10000
>>
>>
>> Whereas the S3 → S3 import copy stage shows an unexpected destination,
>> which was presumably meant to be a temporary location on the source file
>> system but is in fact simply the parent directory:
>>
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Copy
>>       source: s3n://exports-bucket/my_table/year_month=2015-12
>>       destination: s3n://exports-bucket/my_table
>>
>>
>> These stage plans were obtained using:
>>
>> EXPLAIN
>> IMPORT FROM 'spource
>> LOCATION 'destination';
>>
>>
>> I'm beginning to think that this is a bug and not something I can work
>> around, which is unfortunate as I'm not really in a position to deploy a
>> fixed version in the short term. That said, if you confirm that this is not
>> the intended behaviour, I'll raise a JIRA and possibly work on a fix.
>>
>> Thanks - Elliot.
>>
>>
>> On 7 January 2016 at 16:53, Elliot West <teabot@gmail.com
>> <javascript:_e(%7B%7D,'cvml','teabot@gmail.com');>> wrote:
>>
>>> More information: This works if I move the export into EMR's HDFS and
>>> then import from there to a new location in HDFS. It does not work across
>>> FileSystems:
>>>
>>>    - Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
>>>    - Import from EMR HDFS → S3 (complains that HDFS FileSystem was
>>>    expected as the destination. Presumably the same FileSystem instance
>>>    is used for the source and destination).
>>>
>>>
>>>
>>> On 7 January 2016 at 12:17, Elliot West <teabot@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','teabot@gmail.com');>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Following on from my earlier post concerning syncing Hive data from an
>>>> on premise cluster to the cloud, I've been experimenting with the
>>>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
>>>> Amazon EMR. I started out with some simple Exports/Imports as these can be
>>>> the core operations on which replication is founded. This worked fine with
>>>> some on-premise clusters running HDP-2.2.4.
>>>>
>>>>
>>>> // on cluster 1
>>>>
>>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>>> TO '/exports/my_table'
>>>> FOR REPLICATION ('1');
>>>>
>>>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>>>>
>>>> // on cluster 2
>>>>
>>>> IMPORT FROM '/staging/my_table'
>>>> LOCATION '/warehouse/my_table';
>>>>
>>>> // Table created, partition created, data relocated to
>>>> /warehouse/my_table/year_month=2015-12
>>>>
>>>>
>>>> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>>>>
>>>> // On premise HDP2.2.4
>>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>>
>>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>>>>
>>>> // on EMR
>>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>>
>>>> IMPORT FROM 's3n://exports-bucket/my_table'
>>>> LOCATION 's3n://hive-warehouse-bucket/my_table'
>>>>
>>>>
>>>> The IMPORT behaviour I see is bizarre:
>>>>
>>>>    1. Creates the folder 's3n://hive-warehouse/my_table'
>>>>    2. Copies the part file from
>>>>    's3n://exports-bucket/my_table/year_month=2015-12' to
>>>>    's3n://exports-bucket/my_table' (i.e. to the parent)
>>>>    3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>>>>    s3n://exports-bucket/my_table has nested
>>>>    directorys3n://exports-bucket/my_table/year_month=2015-12"
>>>>
>>>> It is as if it is attempting to set the final partition location to
>>>> 's3n://exports-bucket/my_table' and not
>>>> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
>>>> HDP → HDP.
>>>>
>>>> I've tried variations, specifying the partition on import, excluding
>>>> the location, all with the same result. Any thoughts or assistance would
be
>>>> appreciated.
>>>>
>>>> Thanks - Elliot.
>>>>
>>>>
>>>>
>>>
>>

Mime
View raw message