hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: LOAD DATA problem
Date Tue, 20 Mar 2012 13:03:50 GMT
By now you all have realized that the load file semantics have
changed. I can not find the exact issue but here is a related change.


   * [HIVE-306] - Support "INSERT [INTO] destination"

I do not see a way out of this without code. Maybe you could code up a
hive query hook for this.

It defiantly makes a good point that appending copy_of_n after the gz
is bad since that will confuse text input format which relies on
extension to chose decompresser. I will open an issue on that.

On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
<Sean.McNamara@webtrends.com> wrote:
> Gabi-
>
> Glad to know I'm not the only one scratching my head on this one!  The
> changed behavior caught us off guard.
>
> I haven't found a solution in my sleuthing tonight.  Indeed, any help would
> be greatly appreciated on this!
>
> Sean
>
> From: Gabi D <gabid33@gmail.com>
> Reply-To: <user@hive.apache.org>
> Date: Tue, 20 Mar 2012 10:03:04 +0200
> To: <user@hive.apache.org>
> Subject: Re: LOAD DATA problem
>
> Hi Vikas,
> we are facing the same problem that Sean reported and have also noticed that
> this behavior changed with a newer version of hive. Previously, when you
> inserted a file with the same name into a partition/table, hive would fail
> the request (with yet another of its cryptic messages, an issue in itself)
> while now it does load the file and adds the _copy_N addition to the suffix.
> I have to say that, normally, we do not check for existance of a file with
> the same name in our hdfs directories. Our files arrive with unique names
> and if we try to insert the same file again it is because of some failure in
> one of the steps in our flow (e.g., files that were handled and loaded into
> hive have not been removed from our work directory for some reason hence in
> the next run of our load process they were reloaded). We do not want to add
> a step that checks whether a file with the same name already exists in hdfs
> - this is costly and most of the time (hopefully all of it) unnecessary.
> What we would like is to get some 'duplicate file' error and be able to
> disregard it, knowing that the file is already safely in its place.
> Note, that having duplicate files causes us to double count rows which is
> unacceptable for many applications.
> Moreover, we use gz files and since this behavior changes the suffix of the
> file (from gz to gz_copy_N) when this happens we seem to be getting all
> sorts of strange data since hadoop can't recognize that this is a zipped
> file and does not decompress it before reading it ...
> Any help or suggestions on this issue would be much appreciated, we have
> been unable to find any so far.
>
>
> On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive <hadoophive@gmail.com> wrote:
>>
>> hey Sean,
>>
>> its becoz you are appending the file in same partition with the same
>> name(which is not possible) you must change the file name before appending
>> into same partition.
>>
>> AFAIK, i don't think that there is any other way to do that, either you
>> can you partition name or the file name.
>>
>> Thanks
>> Vikas Srivastava
>>
>>
>> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
>> <Sean.McNamara@webtrends.com> wrote:
>>>
>>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
>>> to logs that already exist in a partition?  If the log is already in
>>> hdfs/hive I'd rather it fail and give me an return code or output saying
>>> that the log already exists.
>>>
>>> For example, if I run these queries:
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>>
>>> I end up with:
>>> test_a.bz2
>>> test_b.bz2
>>> test_b_copy_1.bz2
>>> test_b_copy_2.bz2
>>>
>>> However, If I use OVERWRITE it will nuke all the data in the partition
>>> (including test_a.bz2) and I end up with just:
>>> test_b.bz2
>>>
>>> I recall that older versions of hive would not do this.  How do I handle
>>> this case?  Is there a safe atomic way to do this?
>>>
>>> Sean
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message