hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From W S Chung <>
Subject Re: load data unit of work
Date Wed, 15 Jun 2011 19:46:54 GMT
If that is the case, I'll just need to cleanup the partially loaded hdfs
file in a background job. That should do.

On Wed, Jun 15, 2011 at 3:28 PM, Guy Bayes <> wrote:

> I think if you load a file, validate it, and then* alter table add
> partition *to the final table at the end, in the event of crash you only
> have a partially loaded etl file that no one will be querying anyway.
> That should work, though I am not speaking from personal experience, at
> least not with HIVE
> Guy
> On Wed, Jun 15, 2011 at 12:11 PM, W S Chung <> wrote:
>> If the failure of the loading is severe enough, like the whole machine
>> crashes, that there might not be an opportunity to catch the exception and
>> cleanup the partition right away. The best I can think of is to cleanup the
>> partition in a background job reasonably regularly. In that case, before the
>> cleanup, is there anyway I can prevent any query from seeing the data in the
>> partition that should not be there?
>> Or will this really happens? If the metadata is only updated after the
>> successful load, the partition may not exist unless the load runs till its
>> end.
>> On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <>wrote:
>>> easiest way to achieve a level of robustness is probably to load into a
>>> partition and then truncate the partition on the event of failure
>>> Cleaning up after an incomplete load is a problem in many traditional
>>> rdbm's,  you can not always rely on rollback functionality
>>> No explicit delete's in HIVE though so whatever you need to do to massage
>>> and clean the data file is best done prior to inserting it into it's final
>>> destination.
>>> Many of the things you bring up are more ETL best practices then
>>> properties of an RDBMS implementation though.
>>>  Guy
>>> On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <> wrote:
>>>> My question is a "what if" question, not a production issue. It seems
>>>> natural, when replacing traditional database with hive, to ask
>>>> how much robustness is sacrificed for scalability. My concern is that if
>>>> a file is partially loaded, there might not be an easy way to clean up the
>>>> already loaded data before re-loading the data. The lack of unique index
>>>> also does not make it easy to avoid duplicate data either, although
>>>> duplicated data can perhaps be deleted after the load.
>>>> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
>>>>> wrote:
>>>>> Hi,
>>>>> I think this is a problem with open source in general and sometimes it
>>>>> can be very frustrating.
>>>>> However, your question is more of a "what if" question - you're not in
>>>>> the trouble of finding a horrible bug after you deployed to production,
am I
>>>>> right?
>>>>> Regarding your question, I would guess that if LOAD DATA INPATH crashes
>>>>> while moving files into the Hive warehouse, the data which was moved
>>>>> appear as legitimate loaded data. Or the files will be moved but the
>>>>> metadata will not be updated. In any case, you should detect the crash
>>>>> redo the operation. The easiest answer might actually be to look into
>>>>> source code - sometimes it can be easier to find than one would expect.
>>>>> Not a complete answer, but hope this helps a bit.
>>>>> Martin
>>>>> On 14/06/2011 00:47, W S Chung wrote:
>>>>>> I submit a question like this before, but somehow that question is
>>>>>> never delivered. I can even find my question in google. Since I cannot
>>>>>> any admin e-mail/feedback form on the hive website that I can ask
why the
>>>>>> last question is not delivered. There is not much option other than
to post
>>>>>> the question again and hope that the question get through this time.
>>>>>> for the double posting if you have seen my last e-mail.
>>>>>> What is the behaviour if  a client of hive crashes in the middle
>>>>>> running a "load data inpath" for either a local file or a file on
HDFS? Will
>>>>>> the file be partially loaded in the db? Thanks.

View raw message